Eric Lee / smarc-fsl-linux-kernel

20 Aug, 2013

1 commit

8e220cfd1 Fix TLB gather virtual address range invalidation corner cases ... Browse Code »

commit 2b047252d087be7f2ba088b4933cd904f92e6fce upstream.

Ben Tebulin reported:

"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.

Reported-bisected-and-tested-by: Ben Tebulin
Build-testing-by: Stephen Rothwell
Build-testing-by: Richard Weinberger
Reviewed-by: Michal Hocko
Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2013-08-20 23:43:05 +0800

04 Aug, 2013

1 commit

78077c226 mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots ... Browse Code »

commit e6c495a96ce02574e765d5140039a64c8d4e8c9e upstream.

zap_pte_range loops from @addr to @end. In the middle, if it runs out of
batching slots, TLB entries needs to be flushed for @start to @interim,
NOT @interim to @end.

Since ARC port doesn't use page free batching I can't test it myself but
this seems like the right thing to do.

Observed this when working on a fix for the issue at thread:
http://www.spinics.net/lists/linux-arch/msg21736.html

Signed-off-by: Vineet Gupta
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: David Rientjes
Cc: Peter Zijlstra
Acked-by: Catalin Marinas
Cc: Max Filippov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vineet Gupta
2013-08-04 16:50:32 +0800

06 Jun, 2013

1 commit

29eb77825 arch, mm: Remove tlb_fast_mode() ... Browse Code »

Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.

However now we can get concurrency and stuff goes *bang*.

This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.

Cc: Thomas Gleixner
Cc: Russell King
Cc: Tony Luck
Reported-by: Max Filippov
Signed-off-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-06-06 09:07:26 +0800

01 May, 2013

1 commit

5d434fcb2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree updates from Jiri Kosina:
"Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
code cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
mm: Convert print_symbol to %pSR
gfs2: Convert print_symbol to %pSR
m32r: Convert print_symbol to %pSR
iostats.txt: add easy-to-find description for field 6
x86 cmpxchg.h: fix wrong comment
treewide: Fix typo in printk and comments
doc: devicetree: Fix various typos
docbook: fix 8250 naming in device-drivers
pata_pdc2027x: Fix compiler warning
treewide: Fix typo in printks
mei: Fix comments in drivers/misc/mei
treewide: Fix typos in kernel messages
pm44xx: Fix comment for "CONFIG_CPU_IDLE"
doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
mmzone: correct "pags" to "pages" in comment.
kernel-parameters: remove outdated 'noresidual' parameter
Remove spurious _H suffixes from ifdef comments
sound: Remove stray pluses from Kconfig file
radio-shark: Fix printk "CONFIG_LED_CLASS"
doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
...

Linus Torvalds
2013-05-01 00:36:50 +0800

30 Apr, 2013

1 commit

52f37629f THP: fix comment about memory barrier ... Browse Code »

Currently the memory barrier in __do_huge_pmd_anonymous_page doesn't
work. Because lru_cache_add_lru uses pagevec so it could miss spinlock
easily so above rule was broken so user might see inconsistent data.

I was not first person who pointed out the problem. Mel and Peter
pointed out a few months ago and Peter pointed out further that even
spin_lock/unlock can't make sure of it:

http://marc.info/?t=134333512700004

In particular:

*A = a;
LOCK
UNLOCK
*B = b;

may occur as:

LOCK, STORE *B, STORE *A, UNLOCK

At last, Hugh pointed out that even we don't need memory barrier in
there because __SetPageUpdate already have done it from Nick's commit
0ed361dec369 ("mm: fix PageUptodate data race") explicitly.

So this patch fixes comment on THP and adds same comment for
do_anonymous_page, too because everybody except Hugh was missing that.
It means we need a comment about that.

Signed-off-by: Minchan Kim
Acked-by: Andrea Arcangeli
Acked-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-04-30 06:54:37 +0800

29 Apr, 2013

1 commit

071361d34 mm: Convert print_symbol to %pSR ... Browse Code »

Use the new vsprintf extension to avoid any possible
message interleaving.

Signed-off-by: Joe Perches
Acked-by: Christoph Lameter
Signed-off-by: Jiri Kosina

Joe Perches
2013-04-29 21:24:33 +0800

17 Apr, 2013

1 commit

b4cbb197c vm: add vm_iomap_memory() helper function ... Browse Code »

Various drivers end up replicating the code to mmap() their memory
buffers into user space, and our core memory remapping function may be
very flexible but it is unnecessarily complicated for the common cases
to use.

Our internal VM uses pfn's ("page frame numbers") which simplifies
things for the VM, and allows us to pass physical addresses around in a
denser and more efficient format than passing a "phys_addr_t" around,
and having to shift it up and down by the page size. But it just means
that drivers end up doing that shifting instead at the interface level.

It also means that drivers end up mucking around with internal VM things
like the vma details (vm_pgoff, vm_start/end) way more than they really
need to.

So this just exports a function to map a certain physical memory range
into user space (using a phys_addr_t based interface that is much more
natural for a driver) and hides all the complexity from the driver.
Some drivers will still end up tweaking the vm_page_prot details for
things like prefetching or cacheability etc, but that's actually
relevant to the driver, rather than caring about what the page offset of
the mapping is into the particular IO memory region.

Acked-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-04-17 07:45:45 +0800

13 Apr, 2013

1 commit

1de14c3c5 x86-32: Fix possible incomplete TLB invalidate with PAE pagetables ... Browse Code »

This patch attempts to fix:

https://bugzilla.kernel.org/show_bug.cgi?id=56461

The symptom is a crash and messages like this:

chrome: Corrupted page table at address 34a03000
*pdpt = 0000000000000000 *pde = 0000000000000000
Bad pagetable: 000f [#1] PREEMPT SMP

Ingo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb:
enable tlb flush range support for x86") since that code started to free
unused pagetables.

On x86-32 PAE kernels, that new code has the potential to free an entire
PMD page and will clear one of the four page-directory-pointer-table
(aka pgd_t entries).

The hardware aggressively "caches" these top-level entries and invlpg
does not actually affect the CPU's copy. If we clear one we *HAVE* to
do a full TLB flush, otherwise we might continue using a freed pmd page.
(note, we do this properly on the population side in pud_populate()).

This patch tracks whenever we clear one of these entries in the 'struct
mmu_gather', and ensures that we follow up with a full tlb flush.

BTW, I disassembled and checked that:

if (tlb->fullmm == 0)
and
if (!tlb->fullmm && !tlb->need_flush_all)

generate essentially the same code, so there should be zero impact there
to the !PAE case.

Signed-off-by: Dave Hansen
Cc: Peter Anvin
Cc: Ingo Molnar
Cc: Artem S Tashkinov
Signed-off-by: Linus Torvalds

Dave Hansen
2013-04-13 07:56:47 +0800

26 Feb, 2013

1 commit

9043a2650 Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.

Linus Torvalds
2013-02-26 07:41:43 +0800

24 Feb, 2013

8 commits

56f31801c mm: cleanup "swapcache" in do_swap_page ... Browse Code »

I dislike the way in which "swapcache" gets used in do_swap_page():
there is always a page from swapcache there (even if maybe uncached by
the time we lock it), but tests are made according to "swapcache".
Rework that with "page != swapcache", as has been done in unuse_pte().

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-02-24 09:50:24 +0800
5117b3b83 mm,ksm: FOLL_MIGRATION do migration_entry_wait ... Browse Code »

In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.

It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte. follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.

Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow(). I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-02-24 09:50:23 +0800
240aadeed mm: accelerate mm_populate() treatment of THP pages ... Browse Code »

This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.

follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
a THP page, and to 0 in other cases.

__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL, we
don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
we also avoid taking mm->page_table_lock that many times).

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:23 +0800
28a35716d mm: use long type for page counts in mm_populate() and get_user_pages() ... Browse Code »

Use long type for page counts in mm_populate() so as to avoid integer
overflow when running the following test code:

int main(void) {
void *p = mmap(NULL, 0x100000000000, PROT_READ,
MAP_PRIVATE | MAP_ANON, -1, 0);
printf("p: %p\n", p);
mlockall(MCL_CURRENT);
printf("done\n");
return 0;
}

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:22 +0800
cbf86cfe0 ksm: remove old stable nodes more thoroughly ... Browse Code »

Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree. It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.

How can this happen? We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.

Three causes:

1. The old stable tree (built according to the inverse
merge_across_nodes) has not been fully torn down. A stable node
lingers until get_ksm_page() notices that the page it references no
longer references it: but the page is not necessarily freed as soon as
expected, particularly when swapcache.

Fix this with a pass through the old stable tree, applying
get_ksm_page() to each of the remaining nodes (most found stale and
removed immediately), with forced removal of any left over. Unless the
page is still mapped: I've not seen that case, it shouldn't occur, but
better to WARN_ON_ONCE and EBUSY than BUG.

2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it. Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor. Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.

3. It is possible for a KSM page to be faulted back from swapcache
into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
private to ksm.c, so dissolve the distinction between
ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
the one call into ksm.c.

A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
error when read in from swap) to a page which it then marks Uptodate. Fix
this case by not copying, letting do_swap_page() discover the error.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Gerald Schaefer
Cc: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-02-24 09:50:19 +0800
75980e97d mm: fold page->_last_nid into page->flags where possible ... Browse Code »

page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
NUMA configuration with NUMA Balancing will still need an extra page
field. As Peter notes "Completely dropping 32bit support for
CONFIG_NUMA_BALANCING would simplify things, but it would also remove
the warning if we grow enough 64bit only page-flags to push the last-cpu
out."

[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-02-24 09:50:17 +0800
cea10a19b mm: directly use __mlock_vma_pages_range() in find_extend_vma() ... Browse Code »

In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
the vma type - we know we're working with a stack. So, we can call
directly into __mlock_vma_pages_range(), and remove the last
make_pages_present() call site.

Note that we don't use mm_populate() here, so we can't release the
mmap_sem while allocating new stack pages. This is deemed acceptable,
because the stack vmas grow by a bounded number of pages at a time, and
these are anon pages so we don't have to read from disk to populate
them.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
af34770e5 mm: reduce rmap overhead for ex-KSM page copies created on swap faults ... Browse Code »

When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.

These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.

Use page_add_new_anon_rmap() for these copies. This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Hugh Dickins
Cc: Simon Jeons
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Satoru Moriya
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800

21 Jan, 2013

1 commit

373d4d099 taint: add explicit flag to show whether lock dep is still OK. ... Browse Code »

Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.

Signed-off-by: Rusty Russell

Rusty Russell
2013-01-21 14:47:57 +0800

10 Jan, 2013

1 commit

e53289c0c mm: reinstante dropped pmd_trans_splitting() check ... Browse Code »

The check for a pmd being in the process of being split was dropped by
mistake by commit d10e63f29488 ("mm: numa: Create basic numa page
hinting infrastructure"). Put it back.

Reported-by: Dave Jones
Debugged-by: Hillf Danton
Acked-by: Andrea Arcangeli
Acked-by: Mel Gorman
Cc: Kirill Shutemov
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-01-10 00:36:54 +0800

05 Jan, 2013

1 commit

53a59fc67 mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT ... Browse Code »

Since commit e303297e6c3a ("mm: extended batches for generic
mmu_gather") we are batching pages to be freed until either
tlb_next_batch cannot allocate a new batch or we are done.

This works just fine most of the time but we can get in troubles with
non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
on large machines where too aggressive batching might lead to soft
lockups during process exit path (exit_mmap) because there are no
scheduling points down the free_pages_and_swap_cache path and so the
freeing can take long enough to trigger the soft lockup.

The lockup is harmless except when the system is setup to panic on
softlockup which is not that unusual.

The simplest way to work around this issue is to limit the maximum
number of batches in a single mmu_gather. 10k of collected pages should
be safe to prevent from soft lockups (we would have 2ms for one) even if
they are all freed without an explicit scheduling point.

This patch doesn't add any new explicit scheduling points because it
relies on zap_pmd_range during page tables zapping which calls
cond_resched per PMD.

The following lockup has been reported for 3.0 kernel with a huge
process (in order of hundreds gigs but I do know any more details).

BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
Supported: Yes
CPU 56
Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10
RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206
RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
Call Trace:
release_pages+0xc5/0x260
free_pages_and_swap_cache+0x9d/0xc0
tlb_flush_mmu+0x5c/0x80
tlb_finish_mmu+0xe/0x50
exit_mmap+0xbd/0x120
mmput+0x49/0x120
exit_mm+0x122/0x160
do_exit+0x17a/0x430
do_group_exit+0x3d/0xb0
get_signal_to_deliver+0x247/0x480
do_signal+0x71/0x1b0
do_notify_resume+0x98/0xb0
int_signal+0x12/0x17
DWARF2 unwinder stuck at int_signal+0x12/0x17

Signed-off-by: Michal Hocko
Cc: [3.0+]
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-01-05 08:11:46 +0800

18 Dec, 2012

2 commits

2fbc57c53 mm: use kbasename() ... Browse Code »

Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2012-12-18 09:15:17 +0800
b3dd20709 mm/memory.c: suppress warning ... Browse Code »

gcc-4.4.4 screws this up.

mm/memory.c: In function 'do_pmd_numa_page':
mm/memory.c:3594: warning: no return statement in function returning non-void

Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-12-18 09:15:12 +0800

17 Dec, 2012

1 commit

3d59eebc5 Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma ... Browse Code »

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.

In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.

The most recent set of comparisons available from different people are

mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397

The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.

My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.

These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...

Linus Torvalds
2012-12-17 07:18:08 +0800

13 Dec, 2012

4 commits

66521d5aa mm/memory.c: remove unused code from do_wp_page() ... Browse Code »

page_mkwrite is initalized with zero and only set once, from that point
exists no way to get to the oom or oom_free_new labels.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Dominik Dingel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dominik Dingel
2012-12-13 09:38:35 +0800
816422ad7 asm-generic, mm: pgtable: consolidate zero page helpers ... Browse Code »

We have two different implementation of is_zero_pfn() and my_zero_pfn()
helpers: for architectures with and without zero page coloring.

Let's consolidate them in .

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-12-13 09:38:35 +0800
e180377f1 thp: change split_huge_page_pmd() interface ... Browse Code »

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: "H. Peter Anvin"
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-12-13 09:38:31 +0800
93b4796de thp: do_huge_pmd_wp_page(): handle huge zero page ... Browse Code »

On write access to huge zero page we alloc a new huge page and clear it.

If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: "H. Peter Anvin"
Cc: Mel Gorman
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-12-13 09:38:31 +0800

12 Dec, 2012

1 commit

a1dd450bc mm: thp: set the accessed flag for old pages on access fault ... Browse Code »

On x86 memory accesses to pages without the ACCESSED flag set result in
the ACCESSED flag being set automatically. With the ARM architecture a
page access fault is raised instead (and it will continue to be raised
until the ACCESSED flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung
(effectively setting the ACCESSED flag). For transparent huge pages,
pmd_mkyoung will only be called for a write fault.

This patch ensures that faults on transparent hugepages which do not
result in a CoW update the access flags for the faulting pmd.

Signed-off-by: Will Deacon
Cc: Chris Metcalf
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Acked-by: Johannes Weiner
Cc: Ni zhan Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-12-12 09:22:24 +0800

11 Dec, 2012

8 commits

b8593bfda mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate ... Browse Code »

The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement. Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.

This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.

This is hilariously crude and the numbers are arbitrary. Workloads will
converge quite slowly in comparison to what a proper policy should be able
to do. On the plus side, we will chew up less CPU for workloads that have
no need for automatic balancing.

Signed-off-by: Mel Gorman

Mel Gorman
2012-12-11 22:42:55 +0800
9532fec11 mm: numa: Migrate pages handled during a pmd_numa hinting fault ... Browse Code »

To say that the PMD handling code was incorrectly transferred from autonuma
is an understatement. The intention was to handle a PMDs worth of pages
in the same fault and effectively batch the taking of the PTL and page
migration. The copied version instead has the impact of clearing a number
of pte_numa PTE entries and whether any page migration takes place depends
on racing. This just happens to work in some cases.

This patch handles pte_numa faults in batch when a pmd_numa fault is
handled. The pages are migrated if they are currently misplaced.
Essentially this is making an assumption that NUMA locality is
on a PMD boundary but that could be addressed by only setting
pmd_numa if all the pages within that PMD are on the same node
if necessary.

Signed-off-by: Mel Gorman

Mel Gorman
2012-12-11 22:42:49 +0800
03c5a6e16 mm: numa: Add pte updates, hinting and migration stats ... Browse Code »

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u = basic unit = sizeof(void *)
Ca = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
where Cpte is incurred twice for a read and a write and Wlock
is a constant representing the cost of taking or releasing a
lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
where Wi is a constant that should reflect the approximate cost
of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
where Wnuma is the approximate NUMA factor. 1 is local. 1.2
would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
Cnumahint * numa_hint_faults +
Ci * numa_pages_migrated +
Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

convergence = numa_hint_faults_local / numa_hint_faults
where this is measured for the last N number of
numa hints recorded. When the workload is fully
converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel

Mel Gorman
2012-12-11 22:42:48 +0800
cbee9f88e mm: numa: Add fault driven placement and migration ... Browse Code »

NOTE: This patch is based on "sched, numa, mm: Add fault driven
placement and migration policy" but as it throws away all the policy
to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on. In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel

Peter Zijlstra
2012-12-11 22:42:45 +0800
4daae3b4b mm: mempolicy: Use _PAGE_NUMA to migrate pages ... Browse Code »

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra
Signed-off-by: Mel Gorman

Mel Gorman
2012-12-11 22:42:42 +0800
d10e63f29 mm: numa: Create basic numa page hinting infrastructure ... Browse Code »

Note: This patch started as "mm/mpol: Create special PROT_NONE
infrastructure" and preserves the basic idea but steals *very*
heavily from "autonuma: numa hinting page faults entry points" for
the actual fault handlers without the migration parts. The end
result is barely recognisable as either patch so all Signed-off
and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.

[dhillf@gmail.com: Fix typo]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel

Mel Gorman
2012-12-11 22:42:39 +0800
0b9d70529 mm: numa: Support NUMA hinting page faults from gup/gup_fast ... Browse Code »

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Rik van Riel
[ ported to this tree. ]
Signed-off-by: Ingo Molnar
Reviewed-by: Rik van Riel

Andrea Arcangeli
2012-12-11 22:42:37 +0800
4fd017708 mm: Check if PTE is already allocated during page fault ... Browse Code »

With transparent hugepage support, handle_mm_fault() has to be careful
that a normal PMD has been established before handling a PTE fault. To
achieve this, it used __pte_alloc() directly instead of pte_alloc_map
as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
is called once it is known the PMD is safe.

pte_alloc_map() is smart enough to check if a PTE is already present
before calling __pte_alloc but this check was lost. As a consequence,
PTEs may be allocated unnecessarily and the page table lock taken.
Thi useless PTE does get cleaned up but it's a performance hit which
is visible in page_test from aim9.

This patch simply re-adds the check normally done by pte_alloc_map to
check if the PTE needs to be allocated before taking the page table
lock. The effect is noticable in page_test from aim9.

AIM9
2.6.38-vanilla 2.6.38-checkptenone
creat-clo 446.10 ( 0.00%) 424.47 (-5.10%)
page_test 38.10 ( 0.00%) 42.04 ( 9.37%)
brk_test 52.45 ( 0.00%) 51.57 (-1.71%)
exec_test 382.00 ( 0.00%) 456.90 (16.39%)
fork_test 60.11 ( 0.00%) 67.79 (11.34%)
MMTests Statistics: duration
Total Elapsed Time (seconds) 611.90 612.22

(While this affects 2.6.38, it is a performance rather than a
functional bug and normally outside the rules -stable. While the big
performance differences are to a microbench, the difference in fork
and exec performance may be significant enough that -stable wants to
consider the patch)

Reported-by: Raz Ben Yehuda
Signed-off-by: Mel Gorman
Signed-off-by: Andrea Arcangeli
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Rik van Riel
[ Picked this up from the AutoNUMA tree to help
it upstream and to allow apples-to-apples
performance comparisons. ]
Signed-off-by: Ingo Molnar

Mel Gorman
2012-12-11 22:28:34 +0800

17 Nov, 2012

1 commit

1756954c6 mm: fix build warning for uninitialized value ... Browse Code »

do_wp_page() sets mmun_called if mmun_start and mmun_end were
initialized and, if so, may call mmu_notifier_invalidate_range_end()
with these values. This doesn't prevent gcc from emitting a build
warning though:

mm/memory.c: In function `do_wp_page':
mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function

It's much easier to initialize the variables to impossible values and do
a simple comparison to determine if they were initialized to remove the
bool entirely.

Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-11-17 06:33:03 +0800

09 Oct, 2012

3 commits

b676b293f mm, thp: fix mapped pages avoiding unevictable list on mlock ... Browse Code »

When a transparent hugepage is mapped and it is included in an mlock()
range, follow_page() incorrectly avoids setting the page's mlock bit and
moving it to the unevictable lru.

This is evident if you try to mlock(), munlock(), and then mlock() a
range again. Currently:

#define MAP_SIZE (4 << 30) /* 4GB */

void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
mlock(ptr, MAP_SIZE);

$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
Inactive(anon): 6304 kB
Unevictable: 4213924 kB

munlock(ptr, MAP_SIZE);

Inactive(anon): 4186252 kB
Unevictable: 19652 kB

mlock(ptr, MAP_SIZE);

Inactive(anon): 4198556 kB
Unevictable: 21684 kB

Notice that less than 2MB was added to the unevictable list; this is
because these pages in the range are not transparent hugepages since the
4GB range was allocated with mmap() and has no specific alignment. If
posix_memalign() were used instead, unevictable would not have grown at
all on the second mlock().

The fix is to call mlock_vma_page() so that the mlock bit is set and the
page is added to the unevictable list. With this patch:

mlock(ptr, MAP_SIZE);

Inactive(anon): 4056 kB
Unevictable: 4213940 kB

munlock(ptr, MAP_SIZE);

Inactive(anon): 4198268 kB
Unevictable: 19636 kB

mlock(ptr, MAP_SIZE);

Inactive(anon): 4008 kB
Unevictable: 4213940 kB

Signed-off-by: David Rientjes
Acked-by: Hugh Dickins
Reviewed-by: Andrea Arcangeli
Cc: Naoya Horiguchi
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:23:02 +0800
c462f179e mm/memory.c: fix typo in comment ... Browse Code »

Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2012-10-09 15:22:59 +0800
6bdb913f0 mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end ... Browse Code »

In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.

Signed-off-by: Haggai Eran
Cc: Andrea Arcangeli
Cc: Sagi Grimberg
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haggai Eran
2012-10-09 15:22:58 +0800