Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

29 Jun, 2009

1 commit

8326e284f Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, delay: tsc based udelay should have rdtsc_barrier
x86, setup: correct include file in <asm/boot.h>
x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
x86, mce: percpu mcheck_timer should be pinned
x86: Add sysctl to allow panic on IOCK NMI error
x86: Fix uv bau sending buffer initialization
x86, mce: Fix mce resume on 32bit
x86: Move init_gbpages() to setup_arch()
x86: ensure percpu lpage doesn't consume too much vmalloc space
x86: implement percpu_alloc kernel parameter
x86: fix pageattr handling for lpage percpu allocator and re-enable it
x86: reorganize cpa_process_alias()
x86: prepare setup_pcpu_lpage() for pageattr fix
x86: rename remap percpu first chunk allocator to lpage
x86: fix duplicate free in setup_pcpu_remap() failure path
percpu: fix too lazy vunmap cache flushing
x86: Set cpu_llc_id on AMD CPUs

Linus Torvalds
2009-06-29 02:05:28 +0800

26 Jun, 2009

1 commit

9d73777e5 clarify get_user_pages() prototype ... Browse Code »

Currently the 4th parameter of get_user_pages() is called len, but its
in pages, not bytes. Rename the thing to nr_pages to avoid future
confusion.

Signed-off-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Peter Zijlstra
2009-06-26 02:22:13 +0800

25 Jun, 2009

4 commits

c62230482 Merge branches 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro… ... Browse Code »

…/{vfs-2.6,audit-current}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
another race fix in jfs_check_acl()
Get "no acls for this inode" right, fix shmem breakage
inline functions left without protection of ifdef (acl)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL

Linus Torvalds
2009-06-25 05:17:14 +0800
72c04902d Get "no acls for this inode" right, fix shmem breakage ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-06-25 04:58:48 +0800
ba52270d1 SLUB: Don't pass __GFP_FAIL for the initial allocation ... Browse Code »

SLUB uses higher order allocations by default but falls back to small
orders under memory pressure. Make sure the GFP mask used in the initial
allocation doesn't include __GFP_NOFAIL.

Signed-off-by: Pekka Enberg
Signed-off-by: Linus Torvalds

Pekka Enberg
2009-06-25 03:20:14 +0800
4923abf9f Don't warn about order-1 allocations with __GFP_NOFAIL ... Browse Code »

Traditionally, we never failed small orders (even regardless of any
__GFP_NOFAIL flags), and slab will allocate order-1 allocations even for
small allocations that could fit in a single page (in order to avoid
excessive fragmentation).

Maybe we should remove this warning entirely, but before making that
judgement, at least limit it to bigger allocations.

Acked-by: Pekka Enberg
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-25 03:16:49 +0800

24 Jun, 2009

7 commits

06b16e9f6 switch shmem to inode->i_acl ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-06-24 20:17:06 +0800
364df0ebf mm: fix handling of pagesets for downed cpus ... Browse Code »

After downing/upping a cpu, an attempt to set
/proc/sys/vm/percpu_pagelist_fraction results in an oops in
percpu_pagelist_fraction_sysctl_handler().

If a processor is downed then we need to set the pageset pointer back to
the boot pageset.

Updates of the high water marks should not access pagesets of unpopulated
zones (those pointer go to the boot pagesets which would be no longer
functional if their size would be increased beyond zero).

Signed-off-by: Dimitri Sivanich
Signed-off-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dimitri Sivanich
2009-06-24 03:50:05 +0800
a5c9b696e mm: pass mm to grab_swap_token ... Browse Code »

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page, then
oops in grab_swap_token() because the kthread has no mm: GUP passes down
the right mm, so grab_swap_token() ought to be using it.

We have not identified a stronger case than KSM's daemon (not yet in
mainline), but the issue must have come up before, since RHEL has included
a fix for this for years (though a different fix, they just back out of
grab_swap_token if current->mm is unset: which is what we first proposed,
but using the right mm here seems more correct).

Reported-by: Izik Eidus
Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 03:50:05 +0800
95b3692d9 Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 ... Browse Code »

* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: Do not force the slab debugging Kconfig options
kmemleak: use pr_fmt

Linus Torvalds
2009-06-24 02:25:04 +0800
d26ed650d mm: don't rely on flags coincidence ... Browse Code »

Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
and it's tempting to devise a set of Grand Unified Paging flags;
but not today. So until then, let's rely upon the compiler to spot
the coincidence, "rather than have that subtle dependency and a
comment for it" - as you remarked in another context yesterday.

Signed-off-by: Hugh Dickins
Acked-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 02:23:33 +0800
788c7df45 hugetlb: fault flags instead of write_access ... Browse Code »

handle_mm_fault() is now passing fault flags rather than write_access
down to hugetlb_fault(), so better recognize that in hugetlb_fault(),
and in hugetlb_no_page().

Signed-off-by: Hugh Dickins
Acked-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 02:23:33 +0800
cb4cbcf6b mm: fix incorrect page removal from LRU ... Browse Code »

The isolated page is "cursor_page" not "page".

This could cause LRU list corruption under memory pressure, caught by
CONFIG_DEBUG_LIST.

Reported-by: Ingo Molnar
Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Balbir Singh
Tested-by: Daisuke Nishimura
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-24 01:17:28 +0800

23 Jun, 2009

1 commit

ae281064b kmemleak: use pr_fmt ... Browse Code »

Signed-off-by: Joe Perches
Signed-off-by: Catalin Marinas

Joe Perches
2009-06-23 21:40:26 +0800

22 Jun, 2009

4 commits

fa8a7094b x86: implement percpu_alloc kernel parameter ... Browse Code »

According to Andi, it isn't clear whether lpage allocator is worth the
trouble as there are many processors where PMD TLB is far scarcer than
PTE TLB. The advantage or disadvantage probably depends on the actual
size of percpu area and specific processor. As performance
degradation due to TLB pressure tends to be highly workload specific
and subtle, it is difficult to decide which way to go without more
data.

This patch implements percpu_alloc kernel parameter to allow selecting
which first chunk allocator to use to ease debugging and testing.

While at it, make sure all the failure paths report why something
failed to help determining why certain allocator isn't working. Also,
kill the "Great future plan" comment which had already been realized
quite some time ago.

[ Impact: allow explicit percpu first chunk allocator selection ]

Signed-off-by: Tejun Heo
Reported-by: Jan Beulich
Cc: Andi Kleen
Cc: Ingo Molnar

Tejun Heo
2009-06-22 10:56:24 +0800
85ae87c1a percpu: fix too lazy vunmap cache flushing ... Browse Code »

In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as
the page is going to be returned to the page allocator. Only TLB
flushing can be put off such that vmalloc code can handle it lazily.
Fix it.

[ Impact: fix subtle virtual cache flush bug ]

Signed-off-by: Tejun Heo
Cc: Nick Piggin
Cc: Ingo Molnar

Tejun Heo
2009-06-22 10:56:23 +0800
d06063cc2 Move FAULT_FLAG_xyz into handle_mm_fault() callers ... Browse Code »

This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.

Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-22 04:08:22 +0800
30c9f3a9f Remove internal use of 'write_access' in mm/memory.c ... Browse Code »

The fault handling routines really want more fine-grained flags than a
single "was it a write fault" boolean - the callers will want to set
flags like "you can return a retry error" etc.

And that's actually how the VM works internally, but right now the
top-level fault handling functions in mm/memory.c all pass just the
'write_access' boolean around.

This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
variable instead. The 'write_access' calling convention still exists
for the exported 'handle_mm_fault()' function, but that is next.

Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-22 04:06:05 +0800

21 Jun, 2009

1 commit

c277331d5 mm: page_alloc: clear PG_locked before checking flags on free ... Browse Code »

da456f1 "page allocator: do not disable interrupts in free_page_mlock()" moved
the PG_mlocked clearing after the flag sanity checking which makes mlocked
pages always trigger 'bad page'. Fix this by clearing the bit up front.

Reported--and-debugged-by: Peter Chubb
Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Tested-by: Maxim Levitsky
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-06-21 07:08:22 +0800

20 Jun, 2009

1 commit

433f13a72 bootmem.c: avoid c90 declaration warning ... Browse Code »

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Joe Perches
Cc: Pekka Enberg
Cc: Benjamin Herrenschmidt
Cc: Tejun Heo
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2009-06-20 07:46:04 +0800

19 Jun, 2009

7 commits

dcce284a2 mm: Extend gfp masking to the page allocator ... Browse Code »

The page allocator also needs the masking of gfp flags during boot,
so this moves it out of slab/slub and uses it with the page allocator
as well.

Signed-off-by: Benjamin Herrenschmidt
Acked-by: Pekka Enberg
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2009-06-19 04:12:57 +0800
2ffebca6a memcg: fix lru rotation in isolate_pages ... Browse Code »

Try to fix memcg's lru rotation sanity: make memcg use the same logic as
the global LRU does.

Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not
handled. This makes memcg do the same behavior as global LRU and rotate
LRU in the page is busy.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-19 04:03:48 +0800
22a668d7c memcg: fix behavior under memory.limit equals to memsw.limit ... Browse Code »

A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the
user just want to limit the total size of applications, in other words,
not very interested in memory usage itself. In this case, swap-out will
be done only by global-LRU.

But, under current implementation, memory.limit_in_bytes is checked at
first and try_to_free_page() may do swap-out. But, that swap-out is
useless for memsw.limit_in_bytes and the thread may hit limit again.

This patch tries to fix the current behavior at memory.limit ==
memsw.limit case. And documentation is updated to explain the behavior of
this special case.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-19 04:03:48 +0800
8a9478ca7 memcg: fix swap accounting ... Browse Code »

This patch fixes mis-accounting of swap usage in memcg.

In the current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap cannot
be freed cleanly. For handling that, this patch changes that memcg
uncharges swap account when swap has no references other than cache.

By this, memcg's swap entry accounting can be fully synchronous with the
application's behavior.

This patch also changes memcg's hooks for swap-out.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-19 04:03:47 +0800
338c84310 memcg: remove some redundant checks ... Browse Code »

We don't need to check do_swap_account in the case that the function which
checks do_swap_account will never get called if do_swap_account == 0.

Signed-off-by: Li Zefan
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-06-19 04:03:47 +0800
d69b042f3 memcg: add file-based RSS accounting ... Browse Code »

Add file RSS tracking per memory cgroup

We currently don't track file RSS, the RSS we report is actually anon RSS.
All the file mapped pages, come in through the page cache and get
accounted there. This patch adds support for accounting file RSS pages.
It should

1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
that has been proposed by Kamezawa.

Unfortunately, we cannot rename the existing "rss" keyword used in
memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
educate the end user through documentation.

[hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
Signed-off-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Dhaval Giani
Cc: Daisuke Nishimura
Cc: YAMAMOTO Takashi
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-06-19 04:03:47 +0800
8ca739e36 cgroups: make messages more readable ... Browse Code »

Fix some cgroup messages to read better.
Update MAINTAINERS to include mm/*cgroup* files.

Signed-off-by: Randy Dunlap
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2009-06-19 04:03:46 +0800

18 Jun, 2009

4 commits

3fe0344fa Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 ... Browse Code »

* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: Fix some typos in comments
kmemleak: Rename kmemleak_panic to kmemleak_stop
kmemleak: Only use GFP_KERNEL|GFP_ATOMIC for the internal allocations

Linus Torvalds
2009-06-18 01:42:21 +0800
2030117d2 kmemleak: Fix some typos in comments ... Browse Code »

Signed-off-by: Catalin Marinas

Catalin Marinas
2009-06-18 01:29:04 +0800
000814f44 kmemleak: Rename kmemleak_panic to kmemleak_stop ... Browse Code »

This is to avoid the confusion created by the "panic" word.

Signed-off-by: Catalin Marinas
Acked-by: Pekka Enberg

Catalin Marinas
2009-06-18 01:29:03 +0800
216c04b0d kmemleak: Only use GFP_KERNEL|GFP_ATOMIC for the internal allocations ... Browse Code »

Kmemleak allocates memory for pointer tracking and it tries to avoid
using GFP_ATOMIC if the caller doesn't require it. However other gfp
flags may be passed by the caller which aren't required by kmemleak.
This patch filters the gfp flags so that only GFP_KERNEL | GFP_ATOMIC
are used.

Signed-off-by: Catalin Marinas
Acked-by: Pekka Enberg

Catalin Marinas
2009-06-18 01:29:02 +0800

17 Jun, 2009

9 commits

5caf5c7dc Merge branch 'slub/earlyboot' into for-linus ... Browse Code »

Conflicts:
mm/slub.c

Pekka Enberg
2009-06-17 13:30:54 +0800
e03ab9d41 Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and 'slub/fixes' into for-linus Browse Code »

Pekka Enberg
2009-06-17 13:30:15 +0800
517d08699 Merge branch 'akpm' ... Browse Code »

* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...

Manually fix up conflicts due to kmemcheck in mm/slab.c

Linus Torvalds
2009-06-17 10:50:13 +0800
ee993b135 mm: fix lumpy reclaim lru handling at isolate_lru_pages ... Browse Code »

At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
pushed back to "src" list by list_move(). But the page may not be from
"src" list. This pushes the page back to wrong LRU. And list_move()
itself is unnecessary because the page is not on top of LRU. Then, leave
it as it is if __isolate_lru_page() fails.

Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-06-17 10:47:46 +0800
24cf72518 vmscan: count the number of times zone_reclaim() scans and fails ... Browse Code »

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat. If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.

[akpm@linux-foundation.org: name things consistently]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:46 +0800
fa5e084e4 vmscan: do not unconditionally treat zones that fail zone_reclaim() as full ... Browse Code »

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.

Signed-off-by: Mel Gorman
Reviewed-by: Wu Fengguang
Reviewed-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:45 +0800
90afa5de6 vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim ... Browse Code »

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.

The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall. The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation. It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.

This patch:

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with. Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Acked-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:45 +0800
812368102 oom: only oom kill exiting tasks with attached memory ... Browse Code »

When a task is chosen for oom kill and is found to be PF_EXITING,
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.

This privilege is unnecessary, however, if the task has already detached
its mm. Although its possible for the mm to become detached later since
task_lock() is not held, __oom_kill_task() will simply be a no-op in such
circumstances.

Subsequently, it is no longer necessary to warn about killing mm-less
tasks since it is a no-op.

Signed-off-by: David Rientjes
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:45 +0800
9198e96c0 vmscan: handle may_swap more strictly ... Browse Code »

Commit 2e2e425989080cc534fc0fca154cae515f971cf5 ("vmscan,memcg:
reintroduce sc->may_swap) add may_swap flag and handle it at
get_scan_ratio().

But the result of get_scan_ratio() is ignored when priority == 0, so anon
lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
not an expected behavior.

As for memcg especially, because of this behavior many and many pages are
swapped-out just in vain when oom is invoked by mem+swap limit.

This patch is for handling may_swap flag more strictly.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-06-17 10:47:45 +0800