Eric Lee / smarc-fsl-linux-kernel

16 Jun, 2011

40 commits

31b5f8eee Documentation/feature-removal-schedule.txt: remove ns_cgroup from feature-removal-schedule.txt ... Browse Code »

Commit a77aea92010acf ("cgroup: remove the ns_cgroup") removed the
ns_cgroup but it forgot to remove the related doc in
feature-removal-schedule.txt.

Signed-off-by: WANG Cong
Cc: Daniel Lezcano
Cc: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

akpm@linux-foundation.org
2011-06-16 11:04:02 +0800
f9e35b3b4 mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2 ... Browse Code »

Asynchronous compaction is used when promoting to huge pages. This is all
very nice but if there are a number of processes in compacting memory, a
large number of pages can be isolated. An "asynchronous" process can
stall for long periods of time as a result with a user reporting that
firefox can stall for 10s of seconds. This patch aborts asynchronous
compaction if too many pages are isolated as it's better to fail a
hugepage promotion than stall a process.

[minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
Reported-and-tested-by: Ury Stankevich
Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-06-16 11:04:02 +0800
d179e84ba mm: vmscan: do not use page_count without a page pin ... Browse Code »
94

It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU.

[mgorman@suse.de: split out patch]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Reviewed-by: Michal Hocko
Reviewed-by: Minchan Kim

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-06-16 11:04:02 +0800
7454f4ba4 mm: compaction: ensure that the compaction free scanner does not move to the next zone ... Browse Code »

Compaction works with two scanners, a migration and a free scanner. When
the scanners crossover, migration within the zone is complete. The
location of the scanner is recorded on each cycle to avoid excesive
scanning.

When a zone is small and mostly reserved, it's very easy for the migration
scanner to be close to the end of the zone. Then the following situation
can occurs

o migration scanner isolates some pages near the end of the zone
o free scanner starts at the end of the zone but finds that the
migration scanner is already there
o free scanner gets reinitialised for the next cycle as
cc->migrate_pfn + pageblock_nr_pages
moving the free scanner into the next zone
o migration scanner moves into the next zone

When this happens, NR_ISOLATED accounting goes haywire because some of the
accounting happens against the wrong zone. One zones counter remains
positive while the other goes negative even though the overall global
count is accurate. This was reported on X86-32 with !SMP because !SMP
allows the negative counters to be visible. The fact that it is the bug
should theoritically be possible there.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-06-16 11:04:02 +0800
a582a738c compaction: checks correct fragmentation index ... Browse Code »

fragmentation_index() returns -1000 when the allocation might succeed
This doesn't match the comment and code in compaction_suitable(). I
thought compaction_suitable should return COMPACT_PARTIAL in -1000
case, because in this case allocation could succeed depending on
watermarks.

The impact of this is that compaction starts and compact_finished() is
called which rechecks the watermarks and the free lists. It should have
the same result in that compaction should not start but is more expensive.

Acked-by: Mel Gorman
Signed-off-by: Shaohua Li
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-06-16 11:04:02 +0800
5db8a73a8 mm/memory-failure.c: fix page isolated count mismatch ... Browse Code »

Pages isolated for migration are accounted with the vmstat counters
NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to
increment these counters when pages are isolated from the LRU. Once the
pages have been migrated, they are put back on the LRU or freed and the
isolated count is decremented.

Memory failure is not properly accounting for pages it isolates causing
the NR_ISOLATED counters to be negative. On SMP builds, this goes
unnoticed as negative counters are treated as 0 due to expected per-cpu
drift. On UP builds, the counter is treated by too_many_isolated() as a
large value causing processes to enter D state during page reclaim or
compaction. This patch accounts for pages isolated by memory failure
correctly.

[mel@csn.ul.ie: rewrote changelog]
Reviewed-by: Andrea Arcangeli
Signed-off-by: Minchan Kim
Cc: Andi Kleen
Acked-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-06-16 11:04:01 +0800
d2c322587 gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNEL ... Browse Code »

CONFIG_CONSTRUCTORS controls support for running constructor functions at
kernel init time. According to commit b99b87f70c7785ab ("kernel:
constructor support"), gcov (CONFIG_GCOV_KERNEL) needs this. However,
CONFIG_CONSTRUCTORS currently defaults to y, with no option to disable it,
and CONFIG_GCOV_KERNEL depends on it. Instead, default it to n and have
CONFIG_GCOV_KERNEL select it, so that the normal case of
CONFIG_GCOV_KERNEL=n will result in CONFIG_CONSTRUCTORS=n.

Observed in the short list of =y values in a minimal kernel configuration.

Signed-off-by: Josh Triplett
Acked-by: WANG Cong
Acked-by: Peter Oberparleiter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Triplett
2011-06-16 11:04:01 +0800
b0461a44a MAINTAINERS: add entry for legacy eeprom driver ... Browse Code »

I shall maintain the legacy eeprom driver, until we finally get rid of it.

Signed-off-by: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean Delvare
2011-06-16 11:04:01 +0800
fbc29a25e memcg: avoid percpu cached charge draining at softlimit ... Browse Code »

Based on Michal Hocko's comment.

We are not draining per cpu cached charges during soft limit reclaim
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.

Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been already
done it makes no change.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
26fe61684 memcg: fix percpu cached charge draining frequency ... Browse Code »

For performance, memory cgroup caches some "charge" from res_counter into
per cpu cache. This works well but because it's cache, it needs to be
flushed in some cases. Typical cases are

1. when someone hit limit.

2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because of
flushing memcg's cache. Bad things in implementation are that even if a
cpu contains a cache for memcg not related to a memcg which hits limit,
drain code is called.

This patch does
A) check percpu cache contains a useful data or not.
B) check other asynchronous percpu draining doesn't run.
C) don't call local cpu callback.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

[After]
2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

[akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Reviewed-by: Michal Hocko
Tested-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
7ae534d07 memcg: fix wrong check of noswap with softlimit ... Browse Code »

Hierarchical reclaim doesn't swap out if memsw and resource limits are
thye same (memsw_is_minimum == true) because we would hit mem+swap limit
anyway (during hard limit reclaim).

If it comes to the soft limit we shouldn't consider memsw_is_minimum at
all because it doesn't make much sense. Either the soft limit is bellow
the hard limit and then we cannot hit mem+swap limit or the direct reclaim
takes a precedence.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Acked-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
733eda7ac memcg: clear mm->owner when last possible owner leaves ... Browse Code »

The following crash was reported:

> Call Trace:
> [] mem_cgroup_from_task+0x15/0x17
> [] __mem_cgroup_try_charge+0x148/0x4b4
> [] ? need_resched+0x23/0x2d
> [] ? preempt_schedule+0x46/0x4f
> [] mem_cgroup_charge_common+0x9a/0xce
> [] mem_cgroup_newpage_charge+0x5d/0x5f
> [] khugepaged+0x5da/0xfaf
> [] ? __init_waitqueue_head+0x4b/0x4b
> [] ? add_mm_counter.constprop.5+0x13/0x13
> [] kthread+0xa8/0xb0
> [] ? sub_preempt_count+0xa1/0xb4
> [] kernel_thread_helper+0x4/0x10
> [] ? retint_restore_args+0x13/0x13
> [] ? __init_kthread_worker+0x5a/0x5a

What happens is that khugepaged tries to charge a huge page against an mm
whose last possible owner has already exited, and the memory controller
crashes when the stale mm->owner is used to look up the cgroup to charge.

mm->owner has never been set to NULL with the last owner going away, but
nobody cared until khugepaged came along.

Even then it wasn't a problem because the final mmput() on an mm was
forced to acquire and release mmap_sem in write-mode, preventing an
exiting owner to go away while the mmap_sem was held, and until "692e0b3
mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge
was protected by mmap_sem in read-mode.

Instead of going back to relying on the mmap_sem to enforce lifetime of a
task, this patch ensures that mm->owner is properly set to NULL when the
last possible owner is exiting, which the memory controller can handle
just fine.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Hugh Dickins
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Reported-by: Hugh Dickins
Reported-by: Dave Jones
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
37573e8c7 memcg: fix init_page_cgroup nid with sparsemem ... Browse Code »

Commit 21a3c9646873 ("memcg: allocate memory cgroup structures in local
nodes") makes page_cgroup allocation as NUMA aware. But that caused a
problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.

The problem was getting a NID from invalid struct pages, which was not
initialized because it was out-of-node, out of [node_start_pfn,
node_end_pfn)

Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn. But
this may scan a pfn which is not on any node and can access memmap which
is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove a code
to get nid from page->flags. (Then, we'll use valid NID always.)

[akpm@linux-foundation.org: try to fix up comments]
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
895771271 mm: memory.numa_stat: fix file permission ... Browse Code »

Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa
statistics") adds memory.numa_stat file for memory cgroup. But the file
permissions are wrong.

[kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat

This patch fixes the permission as

[root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
-r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
45d16f09d leds: fix the incorrect display in menuconfig ... Browse Code »

Seems when a config option does not have a dependency of the menuconfig,
it messes the display of the rest configs, even if it's a hidden one.

Signed-off-by: Eric Miao
Cc: Richard Purdie
Cc: Valdis Kletnieks
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Miao
2011-06-16 11:04:01 +0800
b0320c7b7 mm: fix negative commitlimit when gigantic hugepages are allocated ... Browse Code »

When 1GB hugepages are allocated on a system, free(1) reports less
available memory than what really is installed in the box. Also, if the
total size of hugepages allocated on a system is over half of the total
memory size, CommitLimit becomes a negative number.

The problem is that gigantic hugepages (order > MAX_ORDER) can only be
allocated at boot with bootmem, thus its frames are not accounted to
'totalram_pages'. However, they are accounted to hugetlb_total_pages()

What happens to turn CommitLimit into a negative number is this
calculation, in fs/proc/meminfo.c:

allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100) + total_swap_pages;

A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.

Also, every vm statistic which depends on 'totalram_pages' will render
confusing values, as if system were 'missing' some part of its memory.

Impact of this bug:

When gigantic hugepages are allocated and sysctl_overcommit_memory ==
OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through
the mentioned 'allowed' calculation and might end up mistakenly returning
-ENOMEM, thus forcing the system to start reclaiming pages earlier than it
would be ususal, and this could cause detrimental impact to overall
system's performance, depending on the workload.

Besides the aforementioned scenario, I can only think of this causing
annoyances with memory reports from /proc/meminfo and free(1).

[akpm@linux-foundation.org: standardize comment layout]
Reported-by: Russ Anderson
Signed-off-by: Rafael Aquini
Acked-by: Russ Anderson
Cc: Andrea Arcangeli
Cc: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2011-06-16 11:04:01 +0800
959ecc48f mm/memory_hotplug.c: fix building of node hotplug zonelist ... Browse Code »

During memory hotplug we refresh zonelists when we online a page in a new
zone. It means that the node's zonelist is not initialized until pages
are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier
will point to NODE_DATA(nid) which has no zone fallback list. Moreover,
if we hot-add cpu-only nodes, alloc_pages() will do no fallback.

This patch makes a zonelist when a new pgdata is available.

Note: in production, at fujitsu, memory should be onlined before cpu
and our server didn't have any memory-less nodes and had no problems.

But recent changes in MEM_GOING_ONLINE+page_cgroup
will access not initialized zonelist of node.
Anyway, there are memory-less node and we need some care.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-16 11:04:01 +0800
de695e159 init/calibrate.c: remove annoying printk ... Browse Code »

Remove calibrate_delay_direct()'s KERN_DEBUG printk related to bogomips
calculation as it appears when booting every core on setups with
'ignore_loglevel' which dmesg people scan for possible issues. As the
message doesn't show very useful information to the widest audience of
kernel boot message gazers, it should be removed.

Introduced by commit d2b463135f84 ("init/calibrate.c: fix for critical
bogoMIPS intermittent calculation failure").

Signed-off-by: Borislav Petkov
Cc: Andrew Worsley
Cc: Phil Carmody
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2011-06-16 11:04:01 +0800
26575f954 w1: W1_MASTER_DS1WM should depend on GENERIC_HARDIRQS ... Browse Code »

On m68k (which doesn't support generic hardirqs yet):

drivers/w1/masters/ds1wm.c: In function `ds1wm_probe':
drivers/w1/masters/ds1wm.c: error: implicit declaration of function `irq_set_irq_type'

Signed-off-by: Geert Uytterhoeven
Cc: Evgeniy Polyakov
Cc: Jean-Franois Dagenais
Cc: Matt Reimer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2011-06-16 11:04:00 +0800
49b24d6b4 include/asm-generic/pgtable.h: fix unbalanced parenthesis ... Browse Code »

Signed-off-by: Nicolas Kaiser
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Kaiser
2011-06-16 11:04:00 +0800
9e6f34385 MAINTAINERS: add videobuf2 maintainers ... Browse Code »

Add maintainers for the videobuf2 V4L2 driver framework.

Signed-off-by: Pawel Osciak
Acked-by: Marek Szyprowski
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pawel Osciak
2011-06-16 11:04:00 +0800
be5ce2f1c leds: move LEDS_GPIO_REGISTER out of menuconfig NEW_LEDS ... Browse Code »

Commit 4440673a95e6 ("leds: provide helper to register "leds-gpio"
devices") broke the display of the NEW_LEDS menu as it didn't depend on
NEW_LEDS and so made "LED drivers" and "LED Triggers" appear at the same
level as "LED Support" instead of below it as it was before 4440673a.

Moving LEDS_GPIO_REGISTER out of the menuconfig NEW_LEDS fixes this
unintended side effect.

Reported-by: Axel Lin
Signed-off-by: Uwe Kleine-König
Cc: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2011-06-16 11:04:00 +0800
9d8f776bf drivers/leds/leds-asic3: make LEDS_ASIC3 depend on LEDS_CLASS ... Browse Code »

We call led_classdev_unregister/led_classdev_register in
asic3_led_remove/asic3_led_probe, thus make LEDS_ASIC3 depend on
LEDS_CLASS.

This patch fixes below build error if LEDS_CLASS is not configured.

LD .tmp_vmlinux1
drivers/built-in.o: In function `asic3_led_remove':
clkdev.c:(.devexit.text+0x1860): undefined reference to `led_classdev_unregister'
drivers/built-in.o: In function `asic3_led_probe':
clkdev.c:(.devinit.text+0xcee8): undefined reference to `led_classdev_register'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Axel Lin
Cc: Paul Parsons
Cc: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Lin
2011-06-16 11:04:00 +0800
185e595f7 MAINTAINERS: Balbir has moved ... Browse Code »

Update my email address. Email will start to the old address bouncing
soon

Signed-off-by: Balbir Singh
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2011-06-16 11:04:00 +0800
bd5dc17be uts: make default hostname configurable, rather than always using "(none)" ... Browse Code »

The "hostname" tool falls back to setting the hostname to "localhost" if
/etc/hostname does not exist. Distribution init scripts have the same
fallback. However, if userspace never calls sethostname, such as when
booting with init=/bin/sh, or otherwise booting a minimal system without
the usual init scripts, the default hostname of "(none)" remains,
unhelpfully appearing in various places such as prompts ("root@(none):~#")
and logs. Furthermore, "(none)" doesn't typically resolve to anything
useful.

Make the default hostname configurable. This removes the need for the
standard fallback, provides a useful default for systems that never call
sethostname, and makes minimal systems that much more useful with less
configuration. Distributions could choose to use "localhost" here to
avoid the fallback, while embedded systems may wish to use a specific
target hostname.

Signed-off-by: Josh Triplett
Acked-by: Linus Torvalds
Acked-by: David Miller
Cc: Serge Hallyn
Cc: Kel Modderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Triplett
2011-06-16 11:04:00 +0800
ca39599c6 BUILD_BUG_ON_ZERO: fix sparse breakage ... Browse Code »

BUILD_BUG_ON_ZERO and BUILD_BUG_ON_NULL must return values, even in the
CHECKER case otherwise various users of it become syntactically invalid.

Signed-off-by: Dr. David Alan Gilbert
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dr. David Alan Gilbert
2011-06-16 11:04:00 +0800
3957c7768 mm: compaction: fix special case -1 order checks ... Browse Code »

Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails") introduced a check for cc->order == -1 in
compact_finished. We should continue compacting in that case because
the request came from userspace and there is no particular order to
compact for. Similar check has been added by 82478fb7 (mm: compaction:
prevent division-by-zero during user-requested compaction) for
compaction_suitable.

The check is, however, done after zone_watermark_ok which uses order as a
right hand argument for shifts. Not only watermark check is pointless if
we can break out without it but it also uses 1 << -1 which is not well
defined (at least from C standard). Let's move the -1 check above
zone_watermark_ok.

[minchan.kim@gmail.com> - caught compaction_suitable]
Signed-off-by: Michal Hocko
Cc: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-06-16 11:04:00 +0800
5f1a19070 mm: fix wrong kunmap_atomic() pointer ... Browse Code »

Running a ktest.pl test, I hit the following bug on x86_32:

------------[ cut here ]------------
WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
Hardware name:
Modules linked in:
Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
Call Trace:
[] warn_slowpath_common+0x7c/0x91
[] ? __kunmap_atomic+0x64/0xc1
[] ? __kunmap_atomic+0x64/0xc1^M
[] warn_slowpath_null+0x22/0x24
[] __kunmap_atomic+0x64/0xc1
[] unmap_vmas+0x43a/0x4e0
[] exit_mmap+0x91/0xd2
[] mmput+0x43/0xad
[] exit_mm+0x111/0x119
[] do_exit+0x1ff/0x5fa
[] ? set_current_blocked+0x3c/0x40
[] ? sigprocmask+0x7e/0x8e
[] do_group_exit+0x65/0x88
[] sys_exit_group+0x18/0x1c
[] sysenter_do_call+0x12/0x38
---[ end trace 8055f74ea3c0eb62 ]---

Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
("mm: extended batches for generic mmu_gather")

But although this was the commit triggering the bug, it was not the one
originally responsible for the bug. That was commit d16dfc550f53 ("mm:
mmu_gather rework").

The code in zap_pte_range() has something that looks like the following:

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
do {
[...]
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(pte - 1, ptl);

The pte starts off pointing at the first element in the page table
directory that was returned by the pte_offset_map_lock(). When it's done
with the page, pte will be pointing to anything between the next entry and
the first entry of the next page inclusive. By doing a pte - 1, this puts
the pte back onto the original page, which is all that pte_unmap_unlock()
needs.

In most archs (64 bit), this is not an issue as the pte is ignored in the
pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it
is essential that the pte passed to pte_unmap_unlock() resides on the same
page that was given by pte_offest_map_lock().

The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
a "break;" from the while loop. This alone did not seem to easily trigger
the bug. But the modifications made by e303297e6 caused that "break;" to
be hit on the first iteration, before the pte++.

The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
be pointing to the previous page. This will cause the wrong page to be
unmapped, and also trigger the warning above.

The simple solution is to just save the pointer given by
pte_offset_map_lock() and use it in the unlock.

Signed-off-by: Steven Rostedt
Cc: Peter Zijlstra
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2011-06-16 11:04:00 +0800
4bbd61fb9 drivers/misc/cs5535-mfgpt.c: fix wrong if condition ... Browse Code »

Fix the wrong `if' condition for the check if the requested timer is
available.

The bitmap avail is used to store if a timer is used already. test_bit()
is used to check if the requested timer is available. If a bit in the
avail bitmap is set it means that the timer is available.

The runtime effect would be that allocating a specific timer always fails
(versus telling cs5535_mfgpt_alloc_timer to allocate the first available
timer, which works).

Signed-off-by: Christian Gmeiner
Acked-by: Andres Salomon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Gmeiner
2011-06-16 11:04:00 +0800
5a1e6f758 drivers/misc/spear13xx_pcie_gadget.c: fix a memory leak in spear_pcie_gadget_probe error path ... Browse Code »

In the case of goto err_kzalloc, we should kfree target.

Signed-off-by: Axel Lin
Acked-by: Pratyush Anand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Lin
2011-06-16 11:04:00 +0800
32e45ff43 mm: increase RECLAIM_DISTANCE to 30 ... Browse Code »

Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on
a very traditional single-process model.

* a master process which reads config files and manages the other
process
* multiple imapd processes, one per connection
* multiple pop3d processes, one per connection
* multiple lmtpd processes, one per connection
* periodical "cleanup" processes.

There are thousands of independent processes. The problem is, recent
Intel motherboard turn on zone_reclaim_mode by default and traditional
prefork model software don't work well on it. Unfortunatelly, such models
are still typical even in the 21st century. We can't ignore them.

This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have
any specific meaning. but 20 means that one-hop QPI/Hypertransport and
such relatively cheap 2-4 socket machine are often used for traditional
servers as above. The intention is that these machines don't use
zone_reclaim_mode.

Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
This patch doesn't change such high-end NUMA machine behavior.

Dave Hansen said:

: I know specifically of pieces of x86 hardware that set the information
: in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
: behavior which that implies.
:
: They've done performance testing and run very large and scary benchmarks
: to make sure that they _want_ this turned on. What this means for them
: is that they'll probably be de-optimized, at least on newer versions of
: the kernel.
:
: If you want to do this for particular systems, maybe _that_'s what we
: should do. Have a list of specific configurations that need the
: defaults overridden either because they're buggy, or they have an
: unusual hardware configuration not really reflected in the distance
: table.

And later said:

: The original change in the hardware tables was for the benefit of a
: benchmark. Said benchmark isn't going to get run on mainline until the
: next batch of enterprise distros drops, at which point the hardware where
: this was done will be irrelevant for the benchmark. I'm sure any new
: hardware will just set this distance to another yet arbitrary value to
: make the kernel do what it wants. :)
:
: Also, when the hardware got _set_ to this initially, I complained. So, I
: guess I'm getting my way now, with this patch. I'm cool with it.

Reported-by: Robert Mueller
Signed-off-by: KOSAKI Motohiro
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Benjamin Herrenschmidt
Cc: "Luck, Tony"
Acked-by: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-06-16 11:03:59 +0800
17441227f checkpatch: add warning for uses of printk_ratelimit ... Browse Code »

Warn about uses of printk_ratelimit() because it uses a global state and
can hide subsequent useful messages.

Signed-off-by: Joe Perches
Cc: Andy Whitcroft
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2011-06-16 11:03:59 +0800
ac5622418 kmsg_dump.h: fix build when CONFIG_PRINTK is disabled ... Browse Code »

Fix when CONFIG_PRINTK is not enabled:

include/linux/kmsg_dump.h:56: error: 'EINVAL' undeclared (first use in this function)
include/linux/kmsg_dump.h:61: error: 'EINVAL' undeclared (first use in this function)

Looks like commit 595dd3d8bf95 ("kmsg_dump: fix build for
CONFIG_PRINTK=n") uses EINVAL without having the needed header file(s),
but I'm sure that I build tested that patch also. oh well.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-06-16 11:03:59 +0800
50c35e5ba memcg: add documentation for the memory.numastat API ... Browse Code »

[akpm@linux-foundation.org: rework text, fit it into 80-cols]
Signed-off-by: Ying Han
Reviewed-by: KOSAKI Motohiro
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-06-16 11:03:59 +0800
d7911ef30 vmscan: implement swap token priority aging ... Browse Code »

While testing for memcg aware swap token, I observed a swap token was
often grabbed an intermittent running process (eg init, auditd) and they
never release a token.

Why?

Some processes (eg init, auditd, audispd) wake up when a process exiting.
And swap token can be get first page-in process when a process exiting
makes no swap token owner. Thus such above intermittent running process
often get a token.

And currently, swap token priority is only decreased at page fault path.
Then, if the process sleep immediately after to grab swap token, the swap
token priority never be decreased. That's obviously undesirable.

This patch implement very poor (and lightweight) priority aging. It only
be affect to the above corner case and doesn't change swap tendency
workload performance (eg multi process qsbench load)

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-06-16 11:03:59 +0800
83cd81a34 vmscan: implement swap token trace ... Browse Code »

This is useful for observing swap token activity.

example output:

zsh-1845 [000] 598.962716: update_swap_token_priority:
mm=ffff88015eaf7700 old_prio=1 new_prio=0
memtoy-1830 [001] 602.033900: update_swap_token_priority:
mm=ffff880037a45880 old_prio=947 new_prio=949
memtoy-1830 [000] 602.041509: update_swap_token_priority:
mm=ffff880037a45880 old_prio=949 new_prio=951
memtoy-1830 [000] 602.051959: update_swap_token_priority:
mm=ffff880037a45880 old_prio=951 new_prio=953
memtoy-1830 [000] 602.052188: update_swap_token_priority:
mm=ffff880037a45880 old_prio=953 new_prio=955
memtoy-1830 [001] 602.427184: put_swap_token:
token_mm=ffff880037a45880
zsh-1789 [000] 602.427281: replace_swap_token:
old_token_mm= (null) old_prio=0 new_token_mm=ffff88015eaf7018
new_prio=2
zsh-1789 [001] 602.433456: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=2 new_prio=4
zsh-1789 [000] 602.437613: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=4 new_prio=6
zsh-1789 [000] 602.443924: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=6 new_prio=8
zsh-1789 [000] 602.451873: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=8 new_prio=10
zsh-1789 [001] 602.462639: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=10 new_prio=12

Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-06-16 11:03:59 +0800
a433658c3 vmscan,memcg: memcg aware swap token ... Browse Code »

Currently, memcg reclaim can disable swap token even if the swap token mm
doesn't belong in its memory cgroup. It's slightly risky. If an admin
creates very small mem-cgroup and silly guy runs contentious heavy memory
pressure workload, every tasks are going to lose swap token and then
system may become unresponsive. That's bad.

This patch adds 'memcg' parameter into disable_swap_token(). and if the
parameter doesn't match swap token, VM doesn't disable it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-06-16 11:03:59 +0800
e1bbd19bc drivers/video/backlight/adp8870_bl.c: add missed props.type conversion ... Browse Code »

Cc: Michael Hennerich
Cc: Mike Frysinger
Cc: Matthew Garrett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-06-16 11:03:59 +0800
a59ec1e7f backlight: new driver for the ADP8870 backlight devices ... Browse Code »

Signed-off-by: Michael Hennerich
Signed-off-by: Mike Frysinger
Cc: Richard Purdie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Hennerich
2011-06-16 11:03:59 +0800
7f81c8890 fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP ... Browse Code »

Commit a8bef8ff6ea1 ("mm: migration: avoid race between shift_arg_pages()
and rmap_walk() during migration by not migrating temporary stacks")
introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
one, so BUILD_BUG_ON is more appropriate.

Signed-off-by: Michal Hocko
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-06-16 11:03:59 +0800