Eric Lee / smarc-fsl-linux-kernel

12 Oct, 2016

1 commit

9099daed9 mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping ... Browse Code »

Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a
physical address to a virtual one using __va(). However, such physical
addresses may sometimes be located in highmem and using __va() is
incorrect, leading to inconsistent object tracking in kmemleak.

The following functions have been added to the kmemleak API and they take
a physical address as the object pointer. They only perform the
corresponding action if the address has a lowmem mapping:

kmemleak_alloc_phys
kmemleak_free_part_phys
kmemleak_not_leak_phys
kmemleak_ignore_phys

The affected calling places have been updated to use the new kmemleak
API.

Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas
Reported-by: Vignesh R
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2016-10-12 06:06:33 +0800

11 Oct, 2016

7 commits

101105b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning

Linus Torvalds
2016-10-11 11:16:43 +0800
3873691e5 Merge remote-tracking branch 'ovl/rename2' into for-linus Browse Code »

Al Viro
2016-10-11 11:02:51 +0800
97d211670 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs xattr updates from Al Viro:
"xattr stuff from Andreas

This completes the switch to xattr_handler ->get()/->set() from
->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Remove {get,set,remove}xattr inode operations
xattr: Stop calling {get,set,remove}xattr inode operations
vfs: Check for the IOP_XATTR flag in listxattr
xattr: Add __vfs_{get,set,remove}xattr helpers
libfs: Use IOP_XATTR flag for empty directory handling
vfs: Use IOP_XATTR flag for bad-inode handling
vfs: Add IOP_XATTR inode operations flag
vfs: Move xattr_resolve_name to the front of fs/xattr.c
ecryptfs: Switch to generic xattr handlers
sockfs: Get rid of getxattr iop
sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
kernfs: Switch to generic xattr handlers
hfs: Switch to generic xattr handlers
jffs2: Remove jffs2_{get,set,remove}xattr macros
xattr: Remove unnecessary NULL attribute name check

Linus Torvalds
2016-10-11 08:11:50 +0800
fed41f7d0 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull splice fixups from Al Viro:
"A couple of fixups for interaction of pipe-backed iov_iter with
O_DIRECT reads + constification of a couple of primitives in uio.h
missed by previous rounds.

Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[btrfs] fix check_direct_IO() for non-iovec iterators
constify iov_iter_count() and iter_is_iovec()
fix ITER_PIPE interaction with direct_IO

Linus Torvalds
2016-10-11 04:38:49 +0800
abb5a14fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.

There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...

Linus Torvalds
2016-10-11 04:04:49 +0800
93c26d7dc Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull protection keys syscall interface from Thomas Gleixner:
"This is the final step of Protection Keys support which adds the
syscalls so user space can actually allocate keys and protect memory
areas with them. Details and usage examples can be found in the
documentation.

The mm side of this has been acked by Mel"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pkeys: Update documentation
x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
x86/pkeys: Fix pkeys build breakage for some non-x86 arches
x86/pkeys: Add self-tests
x86/pkeys: Allow configuration of init_pkru
x86/pkeys: Default to a restrictive init PKRU
pkeys: Add details of system call use to Documentation/
generic syscalls: Wire up memory protection keys syscalls
x86: Wire up protection keys system calls
x86/pkeys: Allocation/free syscalls
x86/pkeys: Make mprotect_key() mask off additional vm_flags
mm: Implement new pkey_mprotect() system call
x86/pkeys: Add fault handling for PF_PK page fault bit

Linus Torvalds
2016-10-11 02:01:51 +0800
c3a690240 fix ITER_PIPE interaction with direct_IO ... Browse Code »

by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().

Signed-off-by: Al Viro

Al Viro
2016-10-11 01:36:06 +0800

08 Oct, 2016

32 commits

e55f1d1d1 Merge remote-tracking branch 'jk/vfs' into work.misc Browse Code »

Al Viro
2016-10-08 23:06:08 +0800
b66484cd7 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- fsnotify updates

- ocfs2 updates

- all of MM

* emailed patches from Andrew Morton : (127 commits)
console: don't prefer first registered if DT specifies stdout-path
cred: simpler, 1D supplementary groups
CREDITS: update Pavel's information, add GPG key, remove snail mail address
mailmap: add Johan Hovold
.gitattributes: set git diff driver for C source code files
uprobes: remove function declarations from arch/{mips,s390}
spelling.txt: "modeled" is spelt correctly
nmi_backtrace: generate one-line reports for idle cpus
arch/tile: adopt the new nmi_backtrace framework
nmi_backtrace: do a local dump_stack() instead of a self-NMI
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
min/max: remove sparse warnings when they're nested
Documentation/filesystems/proc.txt: add more description for maps/smaps
mm, proc: fix region lost in /proc/self/smaps
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
proc: add LSM hook checks to /proc//timerslack_ns
proc: relax /proc//timerslack_ns capability requirements
meminfo: break apart a very long seq_printf with #ifdefs
seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
proc: faster /proc/*/status
...

Linus Torvalds
2016-10-08 12:38:00 +0800
fd50ecadd vfs: Remove {get,set,remove}xattr inode operations ... Browse Code »

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher
Signed-off-by: Al Viro

Andreas Gruenbacher
2016-10-08 09:48:36 +0800
75ba1d07f seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char ... Browse Code »

Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-10-08 09:46:30 +0800
68ba0326b proc: much faster /proc/vmstat ... Browse Code »

Every current KDE system has process named ksysguardd polling files
below once in several seconds:

$ strace -e trace=open -p $(pidof ksysguardd)
Process 1812 attached
open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
open("/proc/net/dev", O_RDONLY) = 8
open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/stat", O_RDONLY) = 8
open("/proc/vmstat", O_RDONLY) = 8

Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

Benchmark is open+read+close 1.000.000 times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% )
15 context-switches # 0.001 K/sec ( +- 1.41% )
1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
104 page-faults # 0.008 K/sec ( +- 0.57% )
45,489,799,349 cycles # 3.460 GHz ( +- 0.03% )
9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% )
2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% )
79,241,190,850 instructions # 1.74 insn per cycle
# 0.13 stalled cycles per insn ( +- 0.00% )
17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% )
176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% )

13.691078109 seconds time elapsed ( +- 0.03% )
^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% )
10 context-switches # 0.001 K/sec ( +- 2.13% )
1 cpu-migrations # 0.000 K/sec
104 page-faults # 0.012 K/sec ( +- 0.56% )
30,384,010,730 cycles # 3.497 GHz ( +- 0.07% )
12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% )
3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% )
28,969,052,879 instructions # 0.95 insn per cycle
# 0.42 stalled cycles per insn ( +- 0.01% )
6,308,245,891 branches # 726.058 M/sec ( +- 0.00% )
214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% )

9.146081052 seconds time elapsed ( +- 0.07% )
^^^^^^^^^^^

vsnprintf() is slow because:

1. format_decode() is busy looking for format specifier: 2 branches
per character (not in this case, but in others)

2. approximately million branches while parsing format mini language
and everywhere

3. just look at what string() does /proc/vmstat is good case because
most of its content are strings

Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800
72e2936c0 mm: remove unnecessary condition in remove_inode_hugepages ... Browse Code »

When the huge page is added to the page cahce (huge_add_to_page_cache),
the page private flag will be cleared. since this code
(remove_inode_hugepages) will only be called for pages in the page
cahce, PagePrivate(page) will always be false.

The patch remove the code without any functional change.

Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Reviewed-by: Naoya Horiguchi
Reviewed-by: Mike Kravetz
Tested-by: Mike Kravetz
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-10-08 09:46:29 +0800
63f53dea0 mm: warn about allocations which stall for too long ... Browse Code »

Currently we do warn only about allocation failures but small
allocations are basically nofail and they might loop in the page
allocator for a long time. Especially when the reclaim cannot make any
progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a
different context to make a forward progress in case there is a lot
memory used by filesystems.

Give us at least a clue when something like this happens and warn about
allocations which take more than 10s. Print the basic allocation
context information along with the cumulative time spent in the
allocation as well as the allocation stack. Repeat the warning after
every 10 seconds so that we know that the problem is permanent rather
than ephemeral.

Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Tetsuo Handa
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:29 +0800
7877cdcc3 mm: consolidate warn_alloc_failed users ... Browse Code »

warn_alloc_failed is currently used from the page and vmalloc
allocators. This is a good reuse of the code except that vmalloc would
appreciate a slightly different warning message. This is already
handled by the fmt parameter except that

"%s: page allocation failure: order:%u, mode:%#x(%pGg)"

is printed anyway. This might be quite misleading because it might be a
vmalloc failure which leads to the warning while the page allocator is
not the culprit here. Fix this by always using the fmt string and only
print the context that makes sense for the particular context (e.g.
order makes only very little sense for the vmalloc context).

Rename the function to not miss any user and also because a later patch
will reuse it also for !failure cases.

Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Tetsuo Handa
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:29 +0800
c2a9737f4 vfs,mm: fix a dead loop in truncate_inode_pages_range() ... Browse Code »

We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:

...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...

Then ftruncate() will not return forever.

The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.

When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:

- find a page with index=0xffffffff, since index>=end, this page won't
be truncated

- index++, and index become 0

- the page with index=0xffffffff will be found again

The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.

Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:

INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0

The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.

Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Fang
2016-10-08 09:46:29 +0800
461a71843 mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE ... Browse Code »

Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
page. No functional change with this patch.

Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Hanjun Guo
Cc: Will Deacon
Cc: Dave Hansen
Cc: Sudeep Holla
Cc: Catalin Marinas
Cc: Mark Rutland
Cc: Rob Herring
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2016-10-08 09:46:29 +0800
82e7d3abe oom: print nodemask in the oom report ... Browse Code »

We have received a hard to explain oom report from a customer. The oom
triggered regardless there is a lot of free memory:

PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
PoolThread cpuset=/ mems_allowed=0-7
Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1
Call Trace:
dump_trace+0x75/0x300
dump_stack+0x69/0x6f
dump_header+0x8e/0x110
oom_kill_process+0xa6/0x350
out_of_memory+0x2b7/0x310
__alloc_pages_slowpath+0x7dd/0x820
__alloc_pages_nodemask+0x1e9/0x200
alloc_pages_vma+0xe1/0x290
do_anonymous_page+0x13e/0x300
do_page_fault+0x1fd/0x4c0
page_fault+0x25/0x30
[...]
active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
active_file:13093 inactive_file:222506 isolated_file:0
unevictable:262144 dirty:2 writeback:0 unstable:0
free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
mapped:261139 shmem:166297 pagetables:2228282 bounce:0
[...]
Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2892 775542 775542
Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 0 772650 772650
Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

So all nodes but Node 0 have a lot of free memory which should suggest
that there is an available memory especially when mems_allowed=0-7. One
could speculate that a massive process has managed to terminate and free
up a lot of memory while racing with the above allocation request.
Although this is highly unlikely it cannot be ruled out.

A further debugging, however shown that the faulting process had
mempolicy (not cpuset) to bind to Node 0. We cannot see that
information from the report though. mems_allowed turned out to be more
confusing than really helpful.

Fix this by always priting the nodemask. It is either mempolicy mask
(and non-null) or the one defined by the cpusets. The new output for
the above oom report would be

PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

This patch doesn't touch show_mem and the node filtering based on the
cpuset node mask because mempolicy is always a subset of cpusets and
seeing the full cpuset oom context might be helpful for tunning more
specific mempolicies inside cpusets (e.g. when they turn out to be too
restrictive). To prevent from ugly ifdefs the mask is printed even for
!NUMA configurations but this should be OK (a single node will be
printed).

Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Sellami Abdelkader
Acked-by: Vlastimil Babka
Cc: David Rientjes
Cc: Sellami Abdelkader
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:29 +0800
9996f05ea mm: clarify why we avoid page_mapcount() for slab pages in dump_page() ... Browse Code »

Let's add comment on why we skip page_mapcount() for sl[aou]b pages.

Link: http://lkml.kernel.org/r/20160922105532.GB24593@node
Signed-off-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-10-08 09:46:29 +0800
8f26e0b17 mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb ... Browse Code »

The old code was always doing:

vma->vm_end = next->vm_end
vma_rb_erase(next) // in __vma_unlink
vma->vm_next = next->vm_next // in __vma_unlink
next = vma->vm_next
vma_gap_update(next)

The new code still does the above for remove_next == 1 and 2, but for
remove_next == 3 it has been changed and it does:

next->vm_start = vma->vm_start
vma_rb_erase(vma) // in __vma_unlink
vma_gap_update(next)

In the latter case, while unlinking "vma", validate_mm_rb() is told to
ignore "vma" that is being removed, but next->vm_start was reduced
instead. So for the new case, to avoid the false positive from
validate_mm_rb, it should be "next" that is ignored when "vma" is
being unlinked.

"vma" and "next" in the above comment, considered pre-swap().

Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Tested-by: Shaun Tancheff
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
86d12e471 mm: vma_adjust: minor comment correction ... Browse Code »

The cases are three not two.

Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
97a42cd43 mm: vma_adjust: remove superfluous check for next not NULL ... Browse Code »

If next would be NULL we couldn't reach such code path.

Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
e86f15ee6 mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk ... Browse Code »

The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.

The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.

The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.

migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated

This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.

Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.

This fix removes the oddness factor from case 8 and it converts it from:

AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN

to:

AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX

XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).

Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Aditya Mandaleeka
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
fb8c41e9a mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case ... Browse Code »

mm->highest_vm_end doesn't need any update.

After finally removing the oddness from vma_merge case 8 that was
causing:

1) constant risk of trouble whenever anybody would check vma fields
from rmap_walks, like it happened when page migration was
introduced and it read the vma->vm_page_prot from a rmap_walk

2) the callers of vma_merge to re-initialize any value different from
the current vma, instead of vma_merge() more reliably returning a
vma that already matches all fields passed as parameter

.. it is also worth to take the opportunity of cleaning up superfluous
code in vma_adjust(), that if not removed adds up to the hard
readability of the function.

Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
6d2329f88 mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE ... Browse Code »

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
6213055f2 mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node() ... Browse Code »

According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
in rare cases cause a hung task warning.

At present, if alloc_stable_node() allocation fails, two break_cows may
want to allocate a couple of pages, and the issue will come up when free
memory is under pressure.

We fix it by adding __GFP_HIGH to GFP, to grant access to memory
reserves, increasing the likelihood of allocation success.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Suggested-by: Hugh Dickins
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-10-08 09:46:29 +0800
ac34dcd26 mm/page_isolation: fix typo: "paes" -> "pages" ... Browse Code »

Fix typo in comment.

Link: http://lkml.kernel.org/r/1474788764-5774-1-git-send-email-ysxie@foxmail.com
Signed-off-by: Yisheng Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2016-10-08 09:46:29 +0800
eb03aa008 mm/hugetlb: improve locking in dissolve_free_huge_pages() ... Browse Code »

For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
the page is not huge at all or a hugepage that is in-use.

Improve this by doing the PageHuge() and page_count() checks already in
dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
dissolve_free_huge_page(), when holding the spinlock, those checks need
to be revalidated.

Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800
082d5b6b6 mm/hugetlb: check for reserved hugepages during memory offline ... Browse Code »

In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800
2247bb335 mm/hugetlb: fix memory offline with hugepage size > memory block size ... Browse Code »

Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

This addresses several issues with hugepages and memory offline. While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.

The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.

This patch (of 3):

dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.

When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly. In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().

To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page(). This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely. Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2016-10-08 09:46:29 +0800
914a05165 mm: nobootmem: move the comment of free_all_bootmem ... Browse Code »

Commit b4def3509d18 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
removed the unnecessary nodeid argument, after that, this comment
becomes more confused. We should move it to the right place.

Fixes: b4def3509d18c1db9 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com
Signed-off-by: Wanlong Gao
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanlong Gao
2016-10-08 09:46:29 +0800
19938e350 mm/shmem.c: constify anon_ops ... Browse Code »

Every other dentry_operations instance is const, and this one might as
well be.

Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2016-10-08 09:46:29 +0800
2d7580738 mm: memcontrol: consolidate cgroup socket tracking ... Browse Code »

The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Tejun Heo
Cc: "David S. Miller"
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-08 09:46:29 +0800
cc30c5d64 mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE() ... Browse Code »

So they are CONFIG_DEBUG_VM-only and more informative.

Cc: Al Viro
Cc: David S. Miller
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Joe Perches
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Santosh Shilimkar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-10-08 09:46:29 +0800
a104808e2 mm: don't emit warning from pagefault_out_of_memory() ... Browse Code »

Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
to warn when we raced with disabling the OOM killer.

Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
properly" introduced a timeout for oom_killer_disable(). Even if we
raced with disabling the OOM killer and the system is OOM livelocked,
the OOM killer will be enabled eventually (in 20 seconds by default) and
the OOM livelock will be solved. Therefore, we no longer need to warn
when we raced with disabling the OOM killer.

Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-10-08 09:46:29 +0800
203114202 mm, compaction: restrict fragindex to costly orders ... Browse Code »

Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
heuristic to prevent excessive compaction for costly orders (i.e. THP).
It's unlikely to make any difference for non-costly orders, especially
with the default threshold. But we cannot afford any uncertainty for
the non-costly orders where the only alternative to successful
reclaim/compaction is OOM. After the recent patches we are guaranteed
maximum effort without heuristics from compaction before deciding OOM,
and fragindex is the last remaining heuristic. Therefore skip fragindex
altogether for non-costly orders.

Suggested-by: Michal Hocko
Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
cc5c9f098 mm, compaction: ignore fragindex from compaction_zonelist_suitable() ... Browse Code »

The compaction_zonelist_suitable() function tries to determine if
compaction will be able to proceed after sufficient reclaim, i.e.
whether there are enough reclaimable pages to provide enough order-0
freepages for compaction.

This addition of reclaimable pages to the free pages works well for the
order-0 watermark check, but in the fragmentation index check we only
consider truly free pages. Thus we can get fragindex value close to 0
which indicates failure do to lack of memory, and wrongly decide that
compaction won't be suitable even after reclaim.

Instead of trying to somehow adjust fragindex for reclaimable pages,
let's just skip it from compaction_zonelist_suitable().

Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
423b452e1 mm, page_alloc: pull no_progress_loops update to should_reclaim_retry() ... Browse Code »

The should_reclaim_retry() makes decisions based on no_progress_loops,
so it makes sense to also update the counter there. It will be also
consistent with should_compact_retry() and compaction_retries. No
functional change.

[hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
9f7e33879 mm, compaction: make full priority ignore pageblock suitability ... Browse Code »

Several people have reported premature OOMs for order-2 allocations
(stack) due to OOM rework in 4.7. In the scenario (parallel kernel
build and dd writing to two drives) many pageblocks get marked as
Unmovable and compaction free scanner struggles to isolate free pages.
Joonsoo Kim pointed out that the free scanner skips pageblocks that are
not movable to prevent filling them and forcing non-movable allocations
to fallback to other pageblocks. Such heuristic makes sense to help
prevent long-term fragmentation, but premature OOMs are relatively more
urgent problem. As a compromise, this patch disables the heuristic only
for the ultimate compaction priority.

Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
Reported-by: Ralf-Peter Rohbeck
Reported-by: Arkadiusz Miskiewicz
Reported-by: Olaf Hering
Suggested-by: Joonsoo Kim
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800