Eric Lee / smarc-fsl-linux-kernel

04 Jul, 2014

5 commits

66d2f4d28 shmem: fix init_page_accessed use to stop !PageLRU bug ... Browse Code »

Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
in isolate_lru_pages() at mm/vmscan.c:1281!

Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
cache allocation where possible") looks like interrupted work-in-progress.

mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
- shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
when the page is always visible in radix_tree, and often already on LRU.

Revert change to shmem_write_begin(), and use init_page_accessed() or
mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
before; but since many other filesystems use [__]page_symlink(), which did
and does mark the page accessed, consider this as rectifying an oversight.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Prabhakar Lad
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-07-04 00:21:54 +0800
0bc1f8b06 hwpoison: fix the handling path of the victimized page frame that belong to non-LRU ... Browse Code »

Until now, the kernel has the same policy to handle victimized page
frames that belong to kernel-space(reserved/slab-subsystem) or
non-LRU(unknown page state). In other word, the result of handling
either of these victimized page frames is (IGNORED | FAILED), and the
return value of memory_failure() is -EBUSY.

This patch is to avoid that memory_failure() returns very soon due to
the "true" value of (!PageLRU(p)), and it also ensures that
action_result() can report more precise information("reserved kernel",
"kernel slab", and "unknown page state") instead of "non LRU",
especially for memory errors which are detected by memory-scrubbing.

Andi said:

: While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
:
: soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
: page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
: page flags: 0x3ffff800004081(locked|slab|head)
: ------------[ cut here ]------------
: kernel BUG at mm/rmap.c:1495!
:
: I think what happened is that a LRU page turned into a slab page in
: parallel with offlining. memory_failure initially tests for this case,
: but doesn't retest later after the page has been locked.
:
: ...
:
: I ran this patch in a loop over night with some stress plus
: the mcelog test suite running in a loop. I cannot guarantee it hit it,
: but it should have given it a good beating.
:
: The kernel survived with no messages, although the mcelog test suite
: got killed at some point because it couldn't fork anymore. Probably
: some unrelated problem.
:
: So the patch is ok for me for .16.

Signed-off-by: Chen Yucong
Acked-by: Naoya Horiguchi
Reported-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Yucong
2014-07-04 00:21:54 +0800
496a8e686 msync: fix incorrect fstart calculation ... Browse Code »

Fix a regression caused by 7fc34a62ca44 ("mm/msync.c: sync only the
requested range in msync()").

xfstests generic/075 fail occured on ext4 data=journal mode because the
intended range was not syncing due to wrong fstart calculation.

Signed-off-by: Namjae Jeon
Signed-off-by: Ashish Sangwan
Reported-by: Eric Whitney
Tested-by: Eric Whitney
Acked-by: Matthew Wilcox
Reviewed-by: Lukas Czerner
Tested-by: Lukas Czerner
Reviewed-by: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namjae Jeon
2014-07-04 00:21:53 +0800
8a5b20aeb slub: fix off by one in number of slab tests ... Browse Code »

min_partial means minimum number of slab cached in node partial list.
So, if nr_partial is less than it, we keep newly empty slab on node
partial list rather than freeing it. But if nr_partial is equal or
greater than it, it means that we have enough partial slabs so should
free newly empty slab. Current implementation missed the equal case so
if we set min_partial is 0, then, at least one slab could be cached.
This is critical problem to kmemcg destroying logic because it doesn't
works properly if some slabs is cached. This patch fixes this problem.

Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").

Signed-off-by: Joonsoo Kim
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-07-04 00:21:53 +0800
dc78327c0 mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER ... Browse Code »

With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
the following is triggered at early boot:

SMP: Total of 8 processors activated.
devtmpfs: initialized
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = fffffe0000050000
[00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
PC is at __list_add+0x10/0xd4
LR is at free_one_page+0x270/0x638
...
Call trace:
__list_add+0x10/0xd4
free_one_page+0x26c/0x638
__free_pages_ok.part.52+0x84/0xbc
__free_pages+0x74/0xbc
init_cma_reserved_pageblock+0xe8/0x104
cma_init_reserved_areas+0x190/0x1e4
do_one_initcall+0xc4/0x154
kernel_init_freeable+0x204/0x2a8
kernel_init+0xc/0xd4

This happens because init_cma_reserved_pageblock() calls
__free_one_page() with pageblock_order as page order but it is bigger
than MAX_ORDER. This in turn causes accesses past zone->free_list[].

Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.

In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
âpageblock_order > MAX_ORDERâ condition will be optimised out since both
sides of the operator are constants. In cases where pageblock size is
variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.

Signed-off-by: Michal Nazarewicz
Reported-by: Mark Salter
Tested-by: Mark Salter
Tested-by: Christopher Covington
Cc: Mel Gorman
Cc: David Rientjes
Cc: Marek Szyprowski
Cc: Catalin Marinas
Cc: [3.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Nazarewicz
2014-07-04 00:21:53 +0800

24 Jun, 2014

9 commits

d05f0cdcb mm: fix crashes from mbind() merging vmas ... Browse Code »

In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
introduced vma merging to mbind(), but it should have also changed the
convention of passing start vma from queue_pages_range() (formerly
check_range()) to new_vma_page(): vma merging may have already freed
that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
worse crashes.

Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi
Tested-by: Naoya Horiguchi
Signed-off-by: Hugh Dickins
Acked-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: [2.6.34+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-06-24 07:47:44 +0800
037873014 slab: fix oops when reading /proc/slab_allocators ... Browse Code »

Commit b1cb0982bdd6 ("change the management method of free objects of
the slab") introduced a bug on slab leak detector
('/proc/slab_allocators'). This detector works like as following
decription.

1. traverse all objects on all the slabs.
2. determine whether it is active or not.
3. if active, print who allocate this object.

but that commit changed the way how to manage free objects, so the logic
determining whether it is active or not is also changed. In before, we
regard object in cpu caches as inactive one, but, with this commit, we
mistakenly regard object in cpu caches as active one.

This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
corrupt free memory in the slab. It unmaps page table mapping if object
is free and map it if object is active. When slab leak detector check
object in cpu caches, it mistakenly think this object active so try to
access object memory to retrieve caller of allocation. At this point,
page table mapping to this object doesn't exist, so oops occurs.

Following is oops message reported from Dave.

It blew up when something tried to read /proc/slab_allocators
(Just cat it, and you should see the oops below)

Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
[snip...]
CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
RIP: 0010:[] [] handle_slab+0x8a/0x180
RSP: 0018:ffff880076925de0 EFLAGS: 00010002
RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
Call Trace:
leaks_show+0xce/0x240
seq_read+0x28e/0x490
proc_reg_read+0x3d/0x80
vfs_read+0x9b/0x160
SyS_read+0x58/0xb0
tracesys+0xd4/0xd9
Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
RIP handle_slab+0x8a/0x180

To fix the problem, I introduce an object status buffer on each slab.
With this, we can track object status precisely, so slab leak detector
would not access active object and no kernel oops would occur. Memory
overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
which is mainly used for debugging, so memory overhead isn't big
problem.

Signed-off-by: Joonsoo Kim
Reported-by: Dave Jones
Reported-by: Tetsuo Handa
Reviewed-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-06-24 07:47:44 +0800
f00cdc6df shmem: fix faulting into a hole while it's punched ... Browse Code »

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-06-24 07:47:44 +0800
f72e7dcdd mm: let mm_find_pmd fix buggy race with THP fault ... Browse Code »

Trinity has reported:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
lock_acquire (arch/x86/include/asm/current.h:14
kernel/locking/lockdep.c:3602)
_raw_spin_lock (include/linux/spinlock_api_smp.h:143
kernel/locking/spinlock.c:151)
remove_migration_pte (mm/migrate.c:137)
rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
remove_migration_ptes (mm/migrate.c:224)
migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
migrate_misplaced_page (mm/migrate.c:1733)
__handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
handle_mm_fault (mm/memory.c:3948)
__get_user_pages (mm/memory.c:1851)
__mlock_vma_pages_range (mm/mlock.c:255)
__mm_populate (mm/mlock.c:711)
SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.

But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.

Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.

I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.

And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.

Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Acked-by: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: Bob Liu
Cc: Christoph Lameter
Cc: Dave Jones
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-06-24 07:47:44 +0800
5338a9372 mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep() ... Browse Code »

Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
in copy_page_rep() called from copy_user_huge_page() called from
do_huge_pmd_wp_page().

I believe this is a DEBUG_PAGEALLOC false positive, due to the source
page being split, and a tail page freed, while copy is in progress; and
not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
prevent a miscopy from being made visible.

Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
the usual get_page() and put_page() on head page in the usual config;
but get and put references to all of the tail pages when
DEBUG_PAGEALLOC.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Cc: Dave Jones
Cc: Andrea Arcangeli
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-06-24 07:47:44 +0800
7cd2b0a34 mm, pcp: allow restoring percpu_pagelist_fraction default ... Browse Code »

Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:

divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6

However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.

This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.

This successfully solves the divide by zero issue at the same time.

Signed-off-by: David Rientjes
Reported-by: Oleg Drokin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-24 07:47:43 +0800
4a705fef9 hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry ... Browse Code »

There's a race between fork() and hugepage migration, as a result we try
to "dereference" a swap entry as a normal pte, causing kernel panic.
The cause of the problem is that copy_hugetlb_page_range() can't handle
"swap entry" family (migration entry and hwpoisoned entry) so let's fix
it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Acked-by: Hugh Dickins
Cc: Christoph Lameter
Cc: [2.6.37+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-24 07:47:43 +0800
13ace4d0d tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported ... Browse Code »

I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
support being added to fallocate(); but didn't realize until now that I
had been too stupid to future-proof shmem_fallocate() against new
additions. -EOPNOTSUPP instead of going on to ordinary fallocation.

Signed-off-by: Hugh Dickins
Reviewed-by: Lukas Czerner
Cc: [3.15]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-06-24 07:47:43 +0800
e020d5bd8 mm: nommu: per-thread vma cache fix ... Browse Code »

mm could be removed from current task struct, using previous vma->vm_mm

It will crash on blackfin after updated to Linux 3.15. The commit "mm:
per-thread vma caching" caused the crash. mm could be removed from
current task struct before

mmput()->
exit_mmap()->
delete_vma_from_mm()

the detailed fault information:

NULL pointer access
Kernel OOPS in progress
Deferred Exception context
CURRENT PROCESS:
COMM=modprobe PID=278 CPU=0
invalid mm
return address: [0x000531de]; contents of:
0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8
0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8
0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090]
0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8

CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
task: 0572b720 ti: 0569e000 task.ti: 0569e000
Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014

SEQUENCER STATUS: Not tainted
SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806
EXCAUSE : 0x27
physical IVG3 asserted : { _trap + 0x0 }
physical IVG15 asserted : { _evt_system_call + 0x0 }
logical irq 6 mapped : { _bfin_coretmr_interrupt + 0x0 }
logical irq 7 mapped : { _bfin_fault_routine + 0x0 }
logical irq 11 mapped : { _l2_ecc_err + 0x0 }
logical irq 13 mapped : { _bfin_fault_routine + 0x0 }
logical irq 39 mapped : { _bfin_twi_interrupt_entry + 0x0 }
logical irq 40 mapped : { _bfin_twi_interrupt_entry + 0x0 }
RETE: /* Maybe null pointer? */
RETN: /* kernel dynamic memory (maybe user-space) */
RETX: /* Maybe fixed code section */
RETS: { _exit_mmap + 0x28 }
PC : { _delete_vma_from_mm + 0x92 }
DCPLB_FAULT_ADDR: /* Maybe null pointer? */
ICPLB_FAULT_ADDR: { _delete_vma_from_mm + 0x92 }
PROCESSOR STATE:
R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000
R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720
P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0
P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74
LB0: 057f523f LT0: 057f523e LC0: 00000000
LB1: 0005317c LT1: 00053172 LC1: 00000002
B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000
B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff
B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000
B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000
A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
USP : 056ffcf8 ASTAT: 02003024

Hardware Trace:
0 Target : { _trap_c + 0x0 }
Source : { _exception_to_level5 + 0xa0 } JUMP.L
1 Target : { _exception_to_level5 + 0x0 }
Source : { _bfin_return_from_exception + 0x6 } RTX
2 Target : { _bfin_return_from_exception + 0x0 }
Source : { _ex_trap_c + 0x70 } JUMP.S
3 Target : { _ex_trap_c + 0x0 }
Source : { _trap + 0x2a } JUMP (P4)
4 Target : { _trap + 0x0 }
FAULT : { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
Source : { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
5 Target : { _delete_vma_from_mm + 0x8e }
Source : { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
6 Target : { _delete_vma_from_mm + 0x0 }
Source : { _exit_mmap + 0x24 } JUMP.L
7 Target : { _exit_mmap + 0x1c }
Source : { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
8 Target : { _exit_mmap + 0x34 }
Source : { __cond_resched + 0x20 } RTS
9 Target : { __cond_resched + 0x0 }
Source : { _exit_mmap + 0x30 } JUMP.L
10 Target : { _exit_mmap + 0x30 }
Source : { _delete_vma + 0xb2 } RTS
11 Target : { _delete_vma + 0xac }
Source : { _kmem_cache_free + 0xba } RTS
12 Target : { _kmem_cache_free + 0xa8 }
Source : { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
13 Target : { _kmem_cache_free + 0x92 }
Source : { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
14 Target : { _kmem_cache_free + 0x34 }
Source : { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
15 Target : { _kmem_cache_free + 0x0 }
Source : { _delete_vma + 0xa8 } JUMP.L
Kernel Stack
Stack info:
SP: [0x0569ff24] /* kernel dynamic memory (maybe user-space) */
Memory from 0x0569ff20 to 056a0000
0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a
0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000
0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000
0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800
0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001
0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001
Return addresses in stack:
address : { _show_cpuinfo + 0x2d2 }
Modules linked in:
Kernel panic - not syncing: Kernel exception
[ end Kernel panic - not syncing: Kernel exception

Signed-off-by: Steven Miao
Acked-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: [3.15.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Miao
2014-06-24 07:47:43 +0800

15 Jun, 2014

1 commit

05064084e fix __swap_writepage() compile failure on old gcc versions ... Browse Code »

Tetsuo Handa wrote:
"Commit 62a8067a7f35 ("bio_vec-backed iov_iter") introduced an unnamed
union inside a struct which gcc-4.4.7 cannot handle. Name the unnamed
union as u in order to fix build failure"

Let's do this instead: there is only one place in the entire tree that
steps into this breakage. Anon structs and unions work in older gcc
versions; as the matter of fact, we have those in the tree - see e.g.
struct ieee80211_tx_info in include/net/mac80211.h

What doesn't work is handling their initializers:

struct {
int a;
union {
int b;
char c;
};
} x[2] = {{.a = 1, .c = 'a'}, {.a = 0, .b = 1}};

is the obvious syntax for initializer, perfectly fine for C11 and
handled correctly by gcc-4.7 or later.

Earlier versions, though, break on it - declaration is fine and so's
access to fields (i.e. x[0].c = 'a'; would produce the right code), but
members of the anon structs and unions are not inserted into the right
namespace. Tellingly, those older versions will not barf on struct {int
a; struct {int a;};}; - looks like they just have it hacked up somewhere
around the handling of . and -> instead of doing the right thing.

The easiest way to deal with that crap is to turn initialization of
those fields (in the only place where we have such initializer of
iov_iter) into plain assignment.

Reported-by: Tetsuo Handa
Reported-by: Russell King
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2014-06-15 08:30:48 +0800

13 Jun, 2014

1 commit

16b905780 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.

In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...

Linus Torvalds
2014-06-13 01:30:18 +0800

12 Jun, 2014

2 commits

9c1d5284c Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4 ' into for-linus ... Browse Code »

Backmerge of dcache.c changes from mainline. It's that, or complete
rebase...

Conflicts:
fs/splice.c

Signed-off-by: Al Viro

Al Viro
2014-06-12 12:28:09 +0800
f6cb85d00 shmem: switch to iter_file_splice_write() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-06-12 12:21:12 +0800

10 Jun, 2014

1 commit

14208b0ec Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy

Documentation/cgroups/unified-hierarchy.txt

is added.

Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...

Linus Torvalds
2014-06-10 06:03:33 +0800

09 Jun, 2014

3 commits

b738d7646 Don't trigger congestion wait on dirty-but-not-writeout pages ... Browse Code »

shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.

The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.

We'll still trigger the congestion waiting if

(a) the process is kswapd, and we hit pages flagged for immediate
reclaim

(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.

This probably needs to be revisited, but as it is this fixes a reported
regression.

Reported-by: Felipe Contreras
Pinpointed-by: Hillf Danton
Cc: Michal Hocko
Cc: Andrew Morton
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-06-09 05:17:00 +0800
f8409abdc Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions. In addition,
improve the scalability of adding and remove inodes from the orphan
list"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
ext4: handle symlink properly with inline_data
ext4: fix wrong assert in ext4_mb_normalize_request()
ext4: fix zeroing of page during writeback
ext4: remove unused local variable "stored" from ext4_readdir(...)
ext4: fix ZERO_RANGE test failure in data journalling
ext4: reduce contention on s_orphan_lock
ext4: use sbi in ext4_orphan_{add|del}()
ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
ext4: remove unnecessary double parentheses
ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
ext4: make local functions static
ext4: fix block bitmap validation when bigalloc, ^flex_bg
ext4: fix block bitmap initialization under sparse_super2
ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
ext4: avoid unneeded lookup when xattr name is invalid
ext4: fix data integrity sync in ordered mode
ext4: remove obsoleted check
ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
ext4: fix locking for O_APPEND writes
...

Linus Torvalds
2014-06-09 04:03:35 +0800
3f17ea6de Merge branch 'next' (accumulated 3.16 merge window patches) into master ... Browse Code »

Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...

Linus Torvalds
2014-06-09 02:31:16 +0800

07 Jun, 2014

14 commits

b1de0d139 mm: convert some level-less printks to pr_* ... Browse Code »

printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitchel Humpherys
2014-06-07 07:08:18 +0800
f59428ab7 mm/kmemleak-test.c: use pr_fmt for logging ... Browse Code »

Signed-off-by: Fabian Frederick
Acked-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-07 07:08:18 +0800
33041a0d7 mm: mark remap_file_pages() syscall as deprecated ... Browse Code »

The remap_file_pages() system call is used to create a nonlinear
mapping, that is, a mapping in which the pages of the file are mapped
into a nonsequential order in memory. The advantage of using
remap_file_pages() over using repeated calls to mmap(2) is that the
former approach does not require the kernel to create additional VMA
(Virtual Memory Area) data structures.

Supporting of nonlinear mapping requires significant amount of
non-trivial code in kernel virtual memory subsystem including hot paths.
Also to get nonlinear mapping work kernel need a way to distinguish
normal page table entries from entries with file offset (pte_file).
Kernel reserves flag in PTE for this purpose. PTE flags are scarce
resource especially on some CPU architectures. It would be nice to free
up the flag for other usage.

Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the
syscall on 32-bit systems to map files bigger than can linearly fit into
32-bit virtual address space. This use-case is not critical anymore
since 64-bit systems are widely available.

The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.

One side effect of emulation (apart from performance) is that user can
hit vm.max_map_count limit more easily due to additional VMAs. See
comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Kirill A. Shutemov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Armin Rigo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-07 07:08:17 +0800
cf2c81279 mm: memcontrol: remove unnecessary memcg argument from soft limit functions ... Browse Code »

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Jianyu Zhan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-06-07 07:08:17 +0800
e231875ba mm: memcontrol: clean up memcg zoneinfo lookup ... Browse Code »

Memcg zoneinfo lookup sites have either the page, the zone, or the node
id and zone index, but sites that only have the zone have to look up the
node id and zone index themselves, whereas sites that already have those
two integers use a function for a simple pointer chase.

Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let
sites that already have node id and zone index - all for each node, for
each zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid].

Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match.

Signed-off-by: Jianyu Zhan
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-07 07:08:17 +0800
aedf95ea0 mm/memblock.c: call kmemleak directly from memblock_(alloc|free) ... Browse Code »

Kmemleak could ignore memory blocks allocated via memblock_alloc()
leading to false positives during scanning. This patch adds the
corresponding callbacks and removes kmemleak_free_* calls in
mm/nobootmem.c to avoid duplication.

The kmemleak_alloc() in mm/nobootmem.c is kept since
__alloc_memory_core_early() does not use memblock_alloc() directly.

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-06-07 07:08:17 +0800
174119628 mm/mempool.c: update the kmemleak stack trace for mempool allocations ... Browse Code »
36

When mempool_alloc() returns an existing pool object, kmemleak_alloc()
is no longer called and the stack trace corresponds to the original
object allocation. This patch updates the kmemleak allocation stack
trace for such objects to make it more useful for debugging.

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-06-07 07:08:17 +0800
ffe2c748e mm: introduce kmemleak_update_trace() ... Browse Code »

The memory allocation stack trace is not always useful for debugging a
memory leak (e.g. radix_tree_preload). This function, when called,
updates the stack trace for an already allocated object.

Signed-off-by: Catalin Marinas
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-06-07 07:08:17 +0800
aae0ad7ae mm/kmemleak.c: use %u to print ->checksum ... Browse Code »

Signed-off-by: Jianpeng Ma
Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianpeng Ma
2014-06-07 07:08:17 +0800
688eb988d vmscan: memcg: always use swappiness of the reclaimed memcg ... Browse Code »

Memory reclaim always uses swappiness of the reclaim target memcg
(origin of the memory pressure) or vm_swappiness for global memory
reclaim. This behavior was consistent (except for difference between
global and hard limit reclaim) because swappiness was enforced to be
consistent within each memcg hierarchy.

After "mm: memcontrol: remove hierarchy restrictions for swappiness and
oom_control" each memcg can have its own swappiness independent of
hierarchical parents, though, so the consistency guarantee is gone.
This can lead to an unexpected behavior. Say that a group is explicitly
configured to not swapout by memory.swappiness=0 but its memory gets
swapped out anyway when the memory pressure comes from its parent with a
It is also unexpected that the knob is meaningless without setting the
hard limit which would trigger the reclaim and enforce the swappiness.
There are setups where the hard limit is configured higher in the
hierarchy by an administrator and children groups are under control of
somebody else who is interested in the swapout behavior but not
necessarily about the memory limit.

From a semantic point of view swappiness is an attribute defining anon
vs.
file proportional scanning of LRU which is memcg specific (unlike
charges which are propagated up the hierarchy) so it should be applied
to the particular memcg's LRU regardless where the memory pressure comes
from.

This patch removes vmscan_swappiness() and stores the swappiness into
the scan_control structure. mem_cgroup_swappiness is then used to
provide the correct value before shrink_lruvec is called. The global
vm_swappiness is used for the root memcg.

[hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-06-07 07:08:17 +0800
cccad5b98 mm: convert use of typedef ctl_table to struct ctl_table ... Browse Code »

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2014-06-07 07:08:16 +0800
844e4d66f slub: search partial list on numa_mem_id(), instead of numa_node_id() ... Browse Code »

Currently, if allocation constraint to node is NUMA_NO_NODE, we search a
partial slab on numa_node_id() node. This doesn't work properly on a
system having memoryless nodes, since it can have no memory on that node
so there must be no partial slab on that node.

On that node, page allocation always falls back to numa_mem_id() first.
So searching a partial slab on numa_node_id() in that case is the proper
solution for the memoryless node case.

Signed-off-by: Joonsoo Kim
Acked-by: Nishanth Aravamudan
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Wanpeng Li
Cc: Han Pingtian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-06-07 07:08:06 +0800
71abdc15a mm: vmscan: clear kswapd's special reclaim powers before exiting ... Browse Code »

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner
Reported-by: Gu Zheng
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-06-07 07:08:06 +0800
d4c54919e mm: add !pte_present() check on existing hugetlb_entry callbacks ... Browse Code »

The age table walker doesn't check non-present hugetlb entry in common
path, so hugetlb_entry() callbacks must check it. The reason for this
behavior is that some callers want to handle it in its own way.

[ I think that reason is bogus, btw - it should just do what the regular
code does, which is to call the "pte_hole()" function for such hugetlb
entries - Linus]

However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
checks on buggy callbacks.

This bug exists for years and got visible by introducing hugepage
migration.

ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())

Reported-by: Sasha Levin
Signed-off-by: Naoya Horiguchi
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
[ Backported to 3.15. Signed-off-by: Josh Boyer ]
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-07 04:21:16 +0800

06 Jun, 2014

1 commit

624483f3e mm: rmap: fix use-after-free in __put_anon_vma ... Browse Code »

While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.

For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.

This fixes it by freeing the child anon_vma before freeing
anon_vma->root.

Signed-off-by: Andrey Ryabinin
Acked-by: Peter Zijlstra
Cc: # v3.0+
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-06-06 23:53:41 +0800

05 Jun, 2014

3 commits

a0abcf2e8 Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next ... Browse Code »

Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"

* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap

Linus Torvalds
2014-06-05 23:05:29 +0800
72d09633c mm/zswap: NUMA aware allocation for zswap_dstmem ... Browse Code »

zswap_dstmem is a percpu block of memory, which should be allocated using
kmalloc_node(), to get better NUMA locality.

Without it, all the blocks are allocated from a single node.

Signed-off-by: Eric Dumazet
Acked-by: Seth Jennings
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2014-06-05 07:54:14 +0800
d867f203b mm/zsmalloc: make zsmalloc module-buildable ... Browse Code »

Now, we can build zsmalloc as module because unmap_kernel_range was
exported.

Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-06-05 07:54:14 +0800