Eric Lee / smarc-fsl-linux-kernel

4764e280d Documentation/atomic_ops.txt: fix sample code ... Browse Code »

list_add() lost a parameter in sample code.

Signed-off-by: Figo.zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:52 +0800

73d05163d eisa.ids: add Network Peripherals FDDI boards ... Browse Code »

Add EISA IDs for Network Peripherals FDDI boards. Descriptions taken from
the respective EISA configuration files.

It's unlikely we'll ever support these cards, the problem being the lack
of documentation. Assuming the policy for the EISA ID database is the
same as for PCI I'm sending these entries for the sake of completeness.

Signed-off-by: Maciej W. Rozycki
Cc: Marc Zyngier
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:52 +0800

cc6f26774 syscalls.h: remove duplicated declarations for sys_pipe2 ... Browse Code »

sys_pipe2 is declared twice in include/linux/syscalls.h.

Signed-off-by: Masatake YAMATO
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:52 +0800

e4c9dd0fb kmap_types: make most arches use generic header file ... Browse Code »

Convert most arches to use asm-generic/kmap_types.h.

Move the KM_FENCE_ macro additions into asm-generic/kmap_types.h,
controlled by __WITH_KM_FENCE from each arch's kmap_types.h file.

Would be nice to be able to add custom KM_types per arch, but I don't yet
see a nice, clean way to do that.

Built on x86_64, i386, mips, sparc, alpha(tonyb), powerpc(tonyb), and
68k(tonyb).

Note: avr32 should be able to remove KM_PTE2 (since it's not used) and
then just use the generic kmap_types.h file. Get avr32 maintainer
approval.

Signed-off-by: Randy Dunlap
Cc:
Acked-by: Mike Frysinger
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Bryan Wu
Cc: Mikael Starvik
Cc: Hirokazu Takata
Cc: "Luck Tony"
Cc: Geert Uytterhoeven
Cc: Ralf Baechle
Cc: David Howells
Cc: Kyle McMartin
Cc: Martin Schwidefsky
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Arnd Bergmann
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:51 +0800

b8d9a8659 Documentation/accounting/getdelays.c intialize the variable before using it ... Browse Code »

Fix compilation warning:

Documentation/accounting/getdelays.c: In function `main':
Documentation/accounting/getdelays.c:249: warning: `cmd_type' may be used uninitialized in this function

This is in fact a false positive.

Signed-off-by: Jaswinder Singh Rajput
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:51 +0800

c67ae69b6 hexdump: remove the trailing space ... Browse Code »

For example:
hex_dump_to_buffer("AB", 2, 16, 1, buf, 100, 0);
pr_info("[%s]\n", buf);

I'd expect the output to be "[41 42]", but actually it's "[41 42 ]"

This patch also makes the required buf to be minimum. To print the hex
format of "AB", a buf with size 6 should be sufficient, but
hex_dump_to_buffer() required at least 8.

Signed-off-by: Li Zefan
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:51 +0800

a9c569539 use printk_once() in several places ... Browse Code »

There are some places to be able to use printk_once instead of hard coding.

Signed-off-by: Minchan Kim
Cc: Dominik Brodowski
Cc: David S. Miller
Cc: Ingo Molnar
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:50 +0800

009789f04 slow-work: use round_jiffies() for thread pool's cull and OOM timers ... Browse Code »

Round the slow work queue's cull and OOM timeouts to whole second boundary
with round_jiffies(). The slow work queue uses a pair of timers to cull
idle threads and, after OOM, to delay new thread creation.

This patch also extracts the mod_timer() logic for the cull timer into a
separate helper function.

By rounding non-time-critical timers such as these to whole seconds, they
will be batched up to fire at the same time rather than being spread out.
This allows the CPU wake up less, which saves power.

Signed-off-by: Chris Peterson
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:49 +0800

b72b71c6c lib: do code optimization for radix_tree_lookup() and radix_tree_lookup_slot() ... Browse Code »

radix_tree_lookup() and radix_tree_lookup_slot() have much the
same code except for the return value.

Introduce radix_tree_lookup_element() to do the real work.

/*
* is_slot == 1 : search for the slot.
* is_slot == 0 : search for the node.
*/
static void * radix_tree_lookup_element(struct radix_tree_root *root,
unsigned long index, int is_slot);

Signed-off-by: Huang Shijie
Cc: Nick Piggin
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:49 +0800

30639b6af groups: move code to kernel/groups.c ... Browse Code »

Move supplementary groups implementation to kernel/groups.c .
kernel/sys.c already accumulated quite a few random stuff.

Do strictly copy/paste + add required headers to compile. Compile-tested
on many configs and archs.

Signed-off-by: Alexey Dobriyan
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

8b0b1db01 remove put_cpu_no_resched() ... Browse Code »

put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
can cause high latencies.

The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
reschedule request caused by an interrupt between the get_cpu() and the
put_cpu_no_resched() can delay the reschedule for at least HZ.

The other users of put_cpu_no_resched() optimize correctly in interrupt
code, but there is no real harm in using the put_cpu() function which is
an alias for preempt_enable(). The extra check of the preemmpt count is
not as critical as the potential source of missing a reschedule.

Debugged in the preempt-rt tree and verified in mainline.

Impact: remove a high latency source

[akpm@linux-foundation.org: build fix]
Signed-off-by: Thomas Gleixner
Acked-by: Ingo Molnar
Cc: Tony Luck
Cc: Trond Myklebust
Cc: "J. Bruce Fields"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

10fc89d01 drbd: add major number to major.h ... Browse Code »

Since we have had a LANANA major number for years, and it is documented in
devices.txt, I think that this first patch can go upstream.

Signed-off-by: Philipp Reisner
Signed-off-by: Lars Ellenberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

0d9c25dde headers: move module_bug_finalize()/module_bug_cleanup() definitions into module.h ... Browse Code »

They're in linux/bug.h at present, which causes include order tangles. In
particular, linux/bug.h cannot be used by linux/atomic.h because,
according to Nikanth:

linux/bug.h pulls in linux/module.h => linux/spinlock.h => asm/spinlock.h
(which uses atomic_inc) => asm/atomic.h.

bug.h is a pretty low-level thing and module.h is a higher-level thing,
IMO.

Cc: Nikanth Karthikesan
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

4938d7e02 poll: avoid extra wakeups in select/poll ... Browse Code »

After introduction of keyed wakeups Davide Libenzi did on epoll, we are
able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP frames call
sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and can sleep
again, because nothing is really available. If number of fds is large,
this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and useless
wakeups are avoided. This reduces number of context switches by about 50%
on some setups, and work performed by sofirq handlers.

Signed-off-by: Eric Dumazet
Acked-by: David S. Miller
Acked-by: Andi Kleen
Acked-by: Ingo Molnar
Acked-by: Davide Libenzi
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

02d5341ae ntfs: use is_power_of_2() function for clarity. ... Browse Code »

Signed-off-by: Robert P. J. Day
Cc: Anton Altaparmakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:48 +0800

b33112d1c kernel/kfifo.c: replace conditional test with is_power_of_2() ... Browse Code »

Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

417dcdf99 atomic: only take lock when the counter drops to zero on UP as well ... Browse Code »

_atomic_dec_and_lock() should not unconditionally take the lock before
calling atomic_dec_and_test() in the UP case. For consistency reasons it
should behave exactly like in the SMP case.

Besides that this works around the problem that with CONFIG_DEBUG_SPINLOCK
this spins in __spin_lock_debug() if the lock is already taken even if the
counter doesn't drop to 0.

Signed-off-by: Jan Blunck
Acked-by: Paul E. McKenney
Acked-by: Nick Piggin
Cc: Valerie Aurora
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

a7d932af0 utsname.h: make new_utsname fields use the proper length constant ... Browse Code »

The members of the new_utsname structure are defined with magic numbers
that *should* correspond to the constant __NEW_UTS_LEN+1. Everywhere
else, code assumes this and uses the constant, so this patch makes the
structure match.

Originally suggested by Serge here:

https://lists.linux-foundation.org/pipermail/containers/2009-March/016258.html

Signed-off-by: Dan Smith
Acked-by: Serge Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

b08cd961c uml: bad macro expansion, parameter is member ... Browse Code »

`ELF_CORE_COPY_REGS(x, y)' will make expansions like:
`(y)[0] = (x)->x.gp[0]' but correct is `(y)[0] = (x)->regs.gp[0]'

Signed-off-by: Roel Kluin
Cc: WANG Cong
Cc: Jeff Dike

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

276c974ac uml: fix a section warning ... Browse Code »

When compiling uml on x86_64:

MODPOST vmlinux.o
WARNING: vmlinux.o (.__syscall_stub.2): unexpected non-allocatable section.
Did you forget to use "ax"/"aw" in a .S file?
Note that for example contains
section definitions for use in .S files.

Because modpost checks for missing SHF_ALLOC section flag. So just add
it.

Signed-off-by: WANG Cong
Cc: Jeff Dike
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

6fa851c3e um: remove obsolete hw_interrupt_type ... Browse Code »

The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.

This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.

Impact: cleanup

Convert the last remaining users to struct irq_chip and remove the
define.

Signed-off-by: Thomas Gleixner
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:47 +0800

7e1cb7804 uml: UML net driver does not allow for vlans ... Browse Code »

See ancient discussion at
http://marc.info/?l=user-mode-linux-devel&m=101990155831279&w=2

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=7854

Signed-off-by: Alan Cox
Reported-by: Paolo 'Blaisorblade' Giarrusso
Cc: Jeff Dike
Cc: Roland Kletzing
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

189e91f5f m32r: remove obsolete hw_interrupt_type ... Browse Code »

The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.

This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.

Impact: cleanup

Convert the last remaining users to struct irq_chip and remove the
define.

Signed-off-by: Thomas Gleixner
Cc: Hirokazu Takata

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

fb26b3e63 alpha: bad macro expansion, parameter is member ... Browse Code »

`for_each_mem_cluster(x, y, z)' will expand to
`for ((x) = (y)->x ...' but correct is `for ((x) = (y)->cluster ...'

Signed-off-by: Roel Kluin
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

44377f622 alpha: remove obsolete hw_interrupt_type ... Browse Code »

The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
been kept around for migration reasons. After more than two years it's
time to remove them finally.

This patch cleans up one of the remaining users. When all such patches
hit mainline we can remove the defines and typedefs finally.

Impact: cleanup

Convert the last remaining users to struct irq_chip and remove the
define.

Signed-off-by: Thomas Gleixner
Cc: Richard Henderson

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

ee993b135 mm: fix lumpy reclaim lru handling at isolate_lru_pages ... Browse Code »

At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
pushed back to "src" list by list_move(). But the page may not be from
"src" list. This pushes the page back to wrong LRU. And list_move()
itself is unnecessary because the page is not on top of LRU. Then, leave
it as it is if __isolate_lru_page() fails.

Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

24cf72518 vmscan: count the number of times zone_reclaim() scans and fails ... Browse Code »

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat. If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.

[akpm@linux-foundation.org: name things consistently]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:46 +0800

fa5e084e4 vmscan: do not unconditionally treat zones that fail zone_reclaim() as full ... Browse Code »

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.

Signed-off-by: Mel Gorman
Reviewed-by: Wu Fengguang
Reviewed-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

90afa5de6 vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim ... Browse Code »

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.

The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall. The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation. It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.

This patch:

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with. Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Acked-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

84a892456 writeback: skip new or to-be-freed inodes ... Browse Code »

1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

generic_forget_inode
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode_lock);
generic_sync_sb_inodes()
spin_lock(&inode_lock);
__iget(inode);
__writeback_single_inode
// see non zero i_count
may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
spin_unlock(&inode_lock);
may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will
guarantee that. So skip I_WILL_FREE inodes for the sake of safety.

Cc: Eric Sandeen
Acked-by: Jeff Layton
Cc: Masayoshi MIZUMA
Signed-off-by: Wu Fengguang
Cc: Artem Bityutskiy
Cc: Christoph Hellwig
Acked-by: Jan Kara
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

812368102 oom: only oom kill exiting tasks with attached memory ... Browse Code »

When a task is chosen for oom kill and is found to be PF_EXITING,
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.

This privilege is unnecessary, however, if the task has already detached
its mm. Although its possible for the mm to become detached later since
task_lock() is not held, __oom_kill_task() will simply be a no-op in such
circumstances.

Subsequently, it is no longer necessary to warn about killing mm-less
tasks since it is a no-op.

Signed-off-by: David Rientjes
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

9198e96c0 vmscan: handle may_swap more strictly ... Browse Code »

Commit 2e2e425989080cc534fc0fca154cae515f971cf5 ("vmscan,memcg:
reintroduce sc->may_swap) add may_swap flag and handle it at
get_scan_ratio().

But the result of get_scan_ratio() is ignored when priority == 0, so anon
lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
not an expected behavior.

As for memcg especially, because of this behavior many and many pages are
swapped-out just in vain when oom is invoked by mem+swap limit.

This patch is for handling may_swap flag more strictly.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

3eb4140f0 vmscan: merge duplicate code in shrink_active_list() ... Browse Code »

The "move pages to active list" and "move pages to inactive list" code
blocks are mostly identical and can be served by a function.

Thanks to Andrew Morton for pointing this out.

Note that buffer_heads_over_limit check will also be carried out for
re-activated pages, which is slightly different from pre-2.6.28 kernels.
Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
scans of active file list when memory pressure is low. So the net effect
could be, the number of buffer heads is now more likely to grow large.

However that's fine according to Johannes' comments:

I don't think that this could be harmful. We just preserve the buffer
mappings of what we consider the working set and with low memory
pressure, as you say, this set is not big.

As to stripping of reactivated pages: the only pages we re-activate
for now are those VM_EXEC mapped ones. Since we don't expect IO from
or to these pages, removing the buffer mappings in case they grow too
large should be okay, I guess.

Cc: Pekka Enberg
Acked-by: Peter Zijlstra
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Johannes Weiner
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:45 +0800

8cab4754d vmscan: make mapped executable pages the first class citizen ... Browse Code »

Protect referenced PROT_EXEC mapped pages from being deactivated.

PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.

Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.

[more details]

( The consequences of this patch will have to be discussed together with
Rik van Riel's recent patch "vmscan: evict use-once pages first". )

( Some of the good points and insights are taken into this changelog.
Thanks to all the involved people for the great LKML discussions. )

the problem
===========

For a typical desktop, the most precious working set is composed of
*actively accessed*
(1) memory mapped executables
(2) and their anonymous pages
(3) and other files
(4) and the dcache/icache/.. slabs
while the least important data are
(5) infrequently used or use-once files

For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.

Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.

anonymous pages
===============

Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.

file pages
==========

Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.

But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.

This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list. Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.

Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.

before patch: guaranteed to be cached if reference intervals < I
after patch: guaranteed to be cached if reference intervals < I+A
(except when randomly reclaimed by the lumpy reclaim)
where
A = time to fully scan the active file LRU
I = time to fully scan the inactive file LRU

Note that normally A >> I.

side effects
============

This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.

One may worry about some one to abuse the PROT_EXEC heuristic. But as
Andrew Morton stated, there are other tricks to getting that sort of boost.

Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.

benchmarks
==========

1) memory tight desktop

1.1) brief summary

- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.

That means X desktop responsiveness is doubled under high memory/swap pressure.

1.2) test scenario

- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
after starting each new program.

1.3) progress timing (seconds)

before after programs
0.02 0.02 N xeyes
0.75 0.76 N firefox
2.02 1.88 N nautilus
3.36 3.17 N nautilus --browser
5.26 4.89 N gthumb
7.12 6.47 N gedit
9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
13.58 12.55 N xterm
15.87 14.57 N mlterm
18.63 17.06 N gnome-terminal
21.16 18.90 N urxvt
26.24 23.48 N gnome-system-monitor
28.72 26.52 N gnome-help
32.15 29.65 N gnome-dictionary
39.66 36.12 N /usr/games/sol
43.16 39.27 N /usr/games/gnometris
48.65 42.56 N /usr/games/gnect
53.31 47.03 N /usr/games/gtali
58.60 52.05 N /usr/games/iagno
65.77 55.42 N /usr/games/gnotravex
70.76 61.47 N /usr/games/mahjongg
76.15 67.11 N /usr/games/gnome-sudoku
86.32 75.15 N /usr/games/glines
92.21 79.70 N /usr/games/glchess
103.79 88.48 N /usr/games/gnomine
113.84 96.51 N /usr/games/gnotski
124.40 102.19 N /usr/games/gnibbles
137.41 114.93 N /usr/games/gnobots2
155.53 125.02 N /usr/games/blackjack
179.85 135.11 N /usr/games/same-gnome
224.49 154.50 N /usr/bin/gnome-window-properties
248.44 162.09 N /usr/bin/gnome-default-applications-properties
282.62 173.29 N /usr/bin/gnome-at-properties
323.72 188.21 N /usr/bin/gnome-typing-monitor
363.99 199.93 N /usr/bin/gnome-at-visual
394.21 206.95 N /usr/bin/gnome-sound-properties
435.14 224.49 N /usr/bin/gnome-at-mobility
463.05 234.11 N /usr/bin/gnome-keybinding-properties
503.75 248.59 N /usr/bin/gnome-about-me
554.00 276.27 N /usr/bin/gnome-display-properties
615.48 304.39 N /usr/bin/gnome-network-preferences
693.03 342.01 N /usr/bin/gnome-mouse-properties
759.90 388.58 N /usr/bin/gnome-appearance-properties
937.90 508.47 N /usr/bin/gnome-control-center
1109.75 587.57 N /usr/bin/gnome-keyboard-properties
1399.05 758.16 N : oocalc
1524.64 830.03 N : oodraw
1684.31 900.03 N : ooimpress
1874.04 993.91 N : oomath
2115.12 1081.89 N : ooweb
2369.02 1161.99 N : oowriter

Note that the last ": oo*" commands are actually commented out.

1.4) vmstat numbers (some relevant ones are marked with *)

before after
nr_free_pages 1293 3898
nr_inactive_anon 59956 53460
nr_active_anon 26815 30026
nr_inactive_file 2657 3218
nr_active_file 2019 2806
nr_unevictable 4 4
nr_mlock 4 4
nr_anon_pages 26706 27859
*nr_mapped 3542 4469
nr_file_pages 72232 67681
nr_dirty 1 0
nr_writeback 123 19
nr_slab_reclaimable 3375 3534
nr_slab_unreclaimable 11405 10665
nr_page_table_pages 8106 7864
nr_unstable 0 0
nr_bounce 0 0
*nr_vmscan_write 394776 230839
nr_writeback_temp 0 0
numa_hit 6843353 3318676
numa_miss 0 0
numa_foreign 0 0
numa_interleave 1719 1719
numa_local 6843353 3318676
numa_other 0 0
*pgpgin 5954683 2057175
*pgpgout 1578276 922744
*pswpin 1486615 512238
*pswpout 394568 230685
pgalloc_dma 277432 56602
pgalloc_dma32 6769477 3310348
pgalloc_normal 0 0
pgalloc_movable 0 0
pgfree 7048396 3371118
pgactivate 2036343 1471492
pgdeactivate 2189691 1612829
pgfault 3702176 3100702
*pgmajfault 452116 201343
pgrefill_dma 12185 7127
pgrefill_dma32 334384 653703
pgrefill_normal 0 0
pgrefill_movable 0 0
pgsteal_dma 74214 22179
pgsteal_dma32 3334164 1638029
pgsteal_normal 0 0
pgsteal_movable 0 0
pgscan_kswapd_dma 1081421 1216199
pgscan_kswapd_dma32 58979118 46002810
pgscan_kswapd_normal 0 0
pgscan_kswapd_movable 0 0
pgscan_direct_dma 2015438 1086109
pgscan_direct_dma32 55787823 36101597
pgscan_direct_normal 0 0
pgscan_direct_movable 0 0
pginodesteal 3461 7281
slabs_scanned 564864 527616
kswapd_steal 2889797 1448082
kswapd_inodesteal 14827 14835
pageoutrun 43459 21562
allocstall 9653 4032
pgrotated 384216 228631

1.5) free numbers at the end of the tests

before patch:
total used free shared buffers cached
Mem: 474 467 7 0 0 236
-/+ buffers/cache: 230 243
Swap: 1023 418 605

after patch:
total used free shared buffers cached
Mem: 474 457 16 0 0 236
-/+ buffers/cache: 221 253
Swap: 1023 404 619

2) memory flushing in a file server

2.1) brief summary

The number of major faults from 50 to 3 during 10% cache hot reads.

That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.

2.2) test scenario

Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:

for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10
iotrace.rb --load pattern-hot-10 --play /b/sparse
vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

and monitor /proc/vmstat during the time. The test box has 2G memory.

I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on

(1) begin: shortly after the big read IO starts;
(2) end: just before the big read IO stops;
(3) restore: the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
windows to restore their working set.

2.3) console mode results

nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

2.6.29 VM_EXEC protection ON:
begin: 2481 2237 8694 630 0 574299
end: 275 231976 233914 633 776271 20933042
restore: 370 232154 234524 691 777183 20958453

2.6.29 VM_EXEC protection ON (second run):
begin: 2434 2237 8493 629 0 574195
end: 284 231970 233536 632 771918 20896129
restore: 399 232218 234789 690 774526 20957909

2.6.30-rc4-mm VM_EXEC protection OFF:
begin: 2479 2344 9659 210 0 579643
end: 284 232010 234142 260 772776 20917184
restore: 379 232159 234371 301 774888 20967849

The above console numbers show that

- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
I'd attribute that improvement to the mmap readahead improvements :-)

- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
That's a huge improvement - which means with the VM_EXEC protection logic,
active mmap pages is pretty safe even under partially cache hot streaming IO.

- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
That roughly means the active mmap pages get 20.8 more chances to get
re-referenced to stay in memory.

- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
dropped pages are mostly inactive ones. The patch has almost no impact in
this aspect, that means it won't unnecessarily increase memory pressure.
(In contrast, your 20% mmap protection ratio will keep them all, and
therefore eliminate the extra 41 major faults to restore working set
of zsh etc.)

The iotrace.rb read throughput is
151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.

2.4) X mode results

We can reach roughly the same conclusions for X desktop:

nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree

2.6.30-rc4-mm VM_EXEC protection ON:
begin: 9740 8920 64075 561 0 678360
end: 768 218254 220029 565 798953 21057006
restore: 857 218543 220987 606 799462 21075710
restore X: 2414 218560 225344 797 799462 21080795

2.6.30-rc4-mm VM_EXEC protection OFF:
begin: 9368 5035 26389 554 0 633391
end: 770 218449 221230 661 646472 17832500
restore: 1113 218466 220978 710 649881 17905235
restore X: 2687 218650 225484 947 802700 21083584

- the absolute nr_mapped drops considerably (to 1/13 of the original size)
during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
during the whole process.

Cc: Elladan
Cc: Nick Piggin
Cc: Andi Kleen
Cc: Christoph Lameter
Acked-by: Rik van Riel
Acked-by: Peter Zijlstra
Acked-by: KOSAKI Motohiro
Reviewed-by: Johannes Weiner
Reviewed-by: Minchan Kim
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:44 +0800

6fe6b7e35 vmscan: report vm_flags in page_referenced() ... Browse Code »

Collect vma->vm_flags of the VMAs that actually referenced the page.

This is preparing for more informed reclaim heuristics, eg. to protect
executable file pages more aggressively. For now only the VM_EXEC bit
will be used by the caller.

Thanks to Johannes, Peter and Minchan for all the good tips.

Acked-by: Peter Zijlstra
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Johannes Weiner
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:44 +0800

608e8e66a mm: add a gfp-translate script to help understand page allocation failure reports ... Browse Code »

The page allocation failure messages include a line that looks like

page allocation failure. order:1, mode:0x4020

The mode is easy to translate but irritating for the lazy and a bit error
prone. This patch adds a very simple helper script gfp-translate for the
mode: portion of the page allocation failure messages. An example usage
looks like

mel@machina:~/linux-2.6 $ scripts/gfp-translate 0x4020
Source: /home/mel/linux-2.6
Parsing: 0x4020
#define __GFP_HIGH (0x20) /* Should access emergency pools? */
#define __GFP_COMP (0x4000) /* Add compound page metadata */

The script is not a work of art but it has come in handy for me a few
times so I thought I would share.

[akpm@linux-foundation.org: clarify an error message]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Pekka Enberg
Cc: Christoph Hellwig
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:44 +0800

168f5ac66 mm cleanup: shmem_file_setup: 'char *' -> 'const char *' for name argument ... Browse Code »

As function shmem_file_setup does not modify/allocate/free/pass given
filename - mark it as const.

Signed-off-by: Sergei Trofimovich
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:44 +0800

aca8bf323 mm: remove file argument from swap_readpage() ... Browse Code »

The file argument resulted from address_space's readpage long time ago.

We don't use it any more. Let's remove unnecessary argement.

Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:44 +0800

8192da6a8 mm: remove annotation of gfp_mask in add_to_swap ... Browse Code »

Hugh removed add_to_swap's gfp_mask argument. (mm: remove gfp_mask from
add_to_swap) So we have to remove annotation of gfp_mask of the function.

Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:43 +0800

73d60b7f7 page-allocator: clear N_HIGH_MEMORY map before we set it again ... Browse Code »

SRAT tables may contains nodes of very small size. The arch code may
decide to not activate such a node. However, currently the early boot
code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
active although these nodes have no present pages.

For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

Signed-off-by: Yinghai Lu
Tested-by: Jack Steiner
Acked-by: Christoph Lameter
Cc: Mel Gorman
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2009-06-17 10:47:43 +0800

GITLAB

Eric Lee / smarc-fsl-linux-kernel

17 Jun, 2009