Eric Lee / smarc-fsl-linux-kernel

20 Nov, 2015

3 commits

38ed781a4 cgroup: memcg: pass correct argument to subsys_cgroup_allow_attach ... Browse Code »

Pass correct argument to subsys_cgroup_allow_attach(), which
expects 'struct cgroup_subsys_state *' argument but we pass
'struct cgroup *' instead which doesn't seem right.

This fixes following 'incompatible pointer type' compiler warning:
----------
CC mm/memcontrol.o
mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’:
mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default]
In file included from include/linux/memcontrol.h:22:0,
from mm/memcontrol.c:29:
include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’
----------

Signed-off-by: Amit Pundir

Amit Pundir
2015-11-20 04:38:37 +0800
12f6e0702 memcg: add permission check ... Browse Code »

Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.

Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand

Rom Lemarchand
2015-11-20 04:38:36 +0800
cc03d1f37 ashmem: Add shmem_set_file to mm/shmem.c ... Browse Code »

NOT FOR STAGING
This patch re-adds the original shmem_set_file to mm/shmem.c
and converts ashmem.c back to using it.

CC: Brian Swetland
CC: Colin Cross
CC: Arve Hjønnevåg
CC: Dima Zavin
CC: Robert Love
CC: Greg KH
Signed-off-by: John Stultz

John Stultz
2015-11-20 04:35:46 +0800

18 Jun, 2015

1 commit

66fc13039 mm: shmem_zero_setup skip security check and lockdep conflict with XFS ... Browse Code »

It appears that, at some point last year, XFS made directory handling
changes which bring it into lockdep conflict with shmem_zero_setup():
it is surprising that mmap() can clone an inode while holding mmap_sem,
but that has been so for many years.

Since those few lockdep traces that I've seen all implicated selinux,
I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
v3.13's commit c7277090927a ("security: shmem: implement kernel private
shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
(and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
which cloned inode in mmap(), but if so, I cannot locate them now.

Reported-and-tested-by: Prarit Bhargava
Reported-and-tested-by: Daniel Wagner
Reported-and-tested-by: Morten Stevens
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2015-06-18 14:40:19 +0800

11 Jun, 2015

4 commits

02f7b4145 zsmalloc: fix a null pointer dereference in destroy_handle_cache() ... Browse Code »

If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
will dereference the NULL pool->handle_cachep.

Modify destroy_handle_cache() to avoid this.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-06-11 07:43:43 +0800
f371763a7 mm: memcontrol: fix false-positive VM_BUG_ON() on -rt ... Browse Code »

On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
swapout path because the spin_lock_irq(&mapping->tree_lock) in the
caller doesn't actually disable the hardware interrupts - which is fine,
because on -rt the tophalves run in process context and so we are still
safe from preemption while updating the statistics.

Remove the VM_BUG_ON() but keep the comment of what we rely on.

Signed-off-by: Johannes Weiner
Reported-by: Clark Williams
Cc: Fernando Lopez-Lezcano
Cc: Steven Rostedt
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-11 07:43:43 +0800
7d638093d memcg: do not call reclaim if !__GFP_WAIT ... Browse Code »

When trimming memcg consumption excess (see memory.high), we call
try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
in the current context, which can result in a deadlock. Fix this.

Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory")
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-06-11 07:43:43 +0800
85bd83998 mm/memory_hotplug.c: set zone->wait_table to null after freeing it ... Browse Code »

Izumi found the following oops when hot re-adding a node:

BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[] [] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [] __wake_up_bit+0x20/0x70
RSP
CR2: ffffc90008963690

Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:

static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}

so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.

Signed-off-by: Gu Zheng
Reported-by: Taku Izumi
Reviewed by: Yasuaki Ishimatsu
Cc: KAMEZAWA Hiroyuki
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gu Zheng
2015-06-11 07:43:43 +0800

29 May, 2015

1 commit

aad653a0b block: discard bdi_unregister() in favour of bdi_destroy() ... Browse Code »

bdi_unregister() now contains very little functionality.

It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

So this isn't wanted.

It also calls bdi_set_min_ratio(). This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.

Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.

So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
writeback_bdi_unregister
event which is already not used.

Reported-by: Mike Snitzer
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
Acked-by: Peter Zijlstra (Intel)
Acked-by: Dan Williams
Tested-by: Nicholas Moulin
Signed-off-by: NeilBrown
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

NeilBrown
2015-05-29 00:12:42 +0800

15 May, 2015

3 commits

b0dc2b9bb mm, numa: really disable NUMA balancing by default on single node machines ... Browse Code »

NUMA balancing is meant to be disabled by default on UMA machines but
the check is using nr_node_ids (highest node) instead of
num_online_nodes (online nodes).

The consequences are that a UMA machine with a node ID of 1 or higher
will enable NUMA balancing. This will incur useless overhead due to
minor faults with the impact depending on the workload. These are the
impact on the stats when running a kernel build on a single node machine
whose node ID happened to be 1:

vanilla patched
NUMA base PTE updates 5113158 0
NUMA huge PMD updates 643 0
NUMA page range updates 5442374 0
NUMA hint faults 2109622 0
NUMA hint local faults 2109622 0
NUMA hint local percent 100 100
NUMA pages migrated 0 0

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: [3.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-05-15 08:55:51 +0800
1ae7013df CMA: page_isolation: check buddy before accessing it ... Browse Code »

I had an issue:

Unable to handle kernel NULL pointer dereference at virtual address 0000082a
pgd = cc970000
[0000082a] *pgd=00000000
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
PC is at get_pageblock_flags_group+0x5c/0xb0
LR is at unset_migratetype_isolate+0x148/0x1b0
pc : [] lr : [] psr: 80000093
sp : c7029d00 ip : 00000105 fp : c7029d1c
r10: 00000001 r9 : 0000000a r8 : 00000004
r7 : 60000013 r6 : 000000a4 r5 : c0a357e4 r4 : 00000000
r3 : 00000826 r2 : 00000002 r1 : 00000000 r0 : 0000003f
Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 2cb7006a DAC: 00000015
Backtrace:
get_pageblock_flags_group+0x0/0xb0
unset_migratetype_isolate+0x0/0x1b0
undo_isolate_page_range+0x0/0xdc
__alloc_contig_range+0x0/0x34c
alloc_contig_range+0x0/0x18

This issue is because when calling unset_migratetype_isolate() to unset
a part of CMA memory, it try to access the buddy page to get its status:

if (order >= pageblock_order) {
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);

if (!is_migrate_isolate_page(buddy)) {

But the begin addr of this part of CMA memory is very close to a part of
memory that is reserved at boot time (not in buddy system). So add a
check before accessing it.

[akpm@linux-foundation.org: use conventional code layout]
Signed-off-by: Hui Zhu
Suggested-by: Laura Abbott
Suggested-by: Joonsoo Kim
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2015-05-15 08:55:51 +0800
8f4fc071b gfp: add __GFP_NOACCOUNT ... Browse Code »

Not all kmem allocations should be accounted to memcg. The following
patch gives an example when accounting of a certain type of allocations to
memcg can effectively result in a memory leak. This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
allocation to go through the root cgroup. It will be used by the next
patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Alternatively, we could introduce a per kmem cache flag disabling
accounting for all allocations of a particular kind, but (a) we would not
be able to bypass accounting for kmalloc then and (b) a kmem cache with
this flag set could not be merged with a kmem cache without this flag,
which would increase the number of global caches and therefore
fragmentation even if the memory cgroup controller is not used.

Despite its generic name, currently __GFP_NOACCOUNT disables accounting
only for kmem allocations while user page allocations are always charged.
To catch abusing of this flag, a warning is issued on an attempt of
passing it to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov
Cc: Tejun Heo
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Greg Thelen
Cc: Greg Kroah-Hartman
Cc: [4.0.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-05-15 08:55:51 +0800

09 May, 2015

1 commit

1daac193f Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A collection of fixes since the merge window;

- fix for a double elevator module release, from Chao Yu. Ancient bug.

- the splice() MORE flag fix from Christophe Leroy.

- a fix for NVMe, fixing a patch that went in in the merge window.
From Keith.

- two fixes for blk-mq CPU hotplug handling, from Ming Lei.

- bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

- two blk-mq fixes from Shaohua, fixing a race on queue stop and a
bad merge issue with FUA writes.

- division-by-zero fix for writeback from Tejun.

- a block bounce page accounting fix, making sure we inc/dec after
bouncing so that pre/post IO pages match up. From Wang YanQing"

* 'for-linus' of git://git.kernel.dk/linux-block:
splice: sendfile() at once fails for big files
blk-mq: don't lose requests if a stopped queue restarts
blk-mq: fix FUA request hang
block: destroy bdi before blockdev is unregistered.
block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
elevator: fix double release of elevator module
writeback: use |1 instead of +1 to protect against div by zero
blk-mq: fix CPU hotplug handling
blk-mq: fix race between timeout and CPU hotplug
NVMe: Fix VPD B0 max sectors translation

Linus Torvalds
2015-05-09 10:49:35 +0800

06 May, 2015

4 commits

e386eed89 mm/hwpoison-inject: check PageLRU of hpage ... Browse Code »

Hwpoison injector checks PageLRU of the raw target page to find out
whether the page is an appropriate target, but current code now filters
out thp tail pages, which prevents us from testing for such cases via this
interface. So let's check hpage instead of p.

Signed-off-by: Naoya Horiguchi
Acked-by: Dean Nelson
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Hidetoshi Seto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:11 +0800
7ea434a4e mm/hwpoison-inject: fix refcounting in no-injection case ... Browse Code »

Hwpoison injection via debugfs:hwpoison/corrupt-pfn takes a refcount of
the target page. But current code doesn't release it if the target page
is not supposed to be injected, which results in memory leak. This patch
simply adds the refcount releasing code.

Signed-off-by: Naoya Horiguchi
Acked-by: Dean Nelson
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Hidetoshi Seto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:10 +0800
602498f9a mm: soft-offline: fix num_poisoned_pages counting on concurrent events ... Browse Code »

If multiple soft offline events hit one free page/hugepage concurrently,
soft_offline_page() can handle the free page/hugepage multiple times,
which makes num_poisoned_pages counter increased more than once. This
patch fixes this wrong counting by checking TestSetPageHWPoison for normal
papes and by checking the return value of dequeue_hwpoisoned_huge_page()
for hugepages.

Signed-off-by: Naoya Horiguchi
Acked-by: Dean Nelson
Cc: Andi Kleen
Cc: [3.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:10 +0800
09789e5de mm/memory-failure: call shake_page() when error hits thp tail page ... Browse Code »

Currently memory_failure() calls shake_page() to sweep pages out from
pcplists only when the victim page is 4kB LRU page or thp head page.
But we should do this for a thp tail page too.

Consider that a memory error hits a thp tail page whose head page is on
a pcplist when memory_failure() runs. Then, the current kernel skips
shake_pages() part, so hwpoison_user_mappings() returns without calling
split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
still cleared due to the skip of shake_page().

As a result, me_huge_page() runs for the thp, which is broken behavior.

One effect is a leak of the thp. And another is to fail to isolate the
memory error, so later access to the error address causes another MCE,
which kills the processes which used the thp.

This patch fixes this problem by calling shake_page() for thp tail case.

Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
Signed-off-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Acked-by: Dean Nelson
Cc: Andrea Arcangeli
Cc: Hidetoshi Seto
Cc: Jin Dongming
Cc: [3.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:10 +0800

27 Apr, 2015

1 commit

9ec3a646f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only

Linus Torvalds
2015-04-27 08:22:07 +0800

24 Apr, 2015

1 commit

464d1387a writeback: use |1 instead of +1 to protect against div by zero ... Browse Code »

mm/page-writeback.c has several places where 1 is added to the divisor
to prevent division by zero exceptions; however, if the original
divisor is equivalent to -1, adding 1 leads to division by zero.

There are three places where +1 is used for this purpose - one in
pos_ratio_polynom() and two in bdi_position_ratio(). The second one
in bdi_position_ratio() actually triggered div-by-zero oops on a
machine running a 3.10 kernel. The divisor is

x_intercept - bdi_setpoint + 1 == span + 1

span is confirmed to be (u32)-1. It isn't clear how it ended up that
but it could be from write bandwidth calculation underflow fixed by
c72efb658f7c ("writeback: fix possible underflow in write bandwidth
calculation").

At any rate, +1 isn't a proper protection against div-by-zero. This
patch converts all +1 protections to |1. Note that
bdi_update_dirty_ratelimit() was already using |1 before this patch.

Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2015-04-24 00:36:33 +0800

17 Apr, 2015

1 commit

4fc8adcfe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
...

Linus Torvalds
2015-04-17 11:27:56 +0800

16 Apr, 2015

20 commits

eea3a0026 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patchbomb from Andrew Morton:

- the rest of MM

- various misc bits

- add ability to run /sbin/reboot at reboot time

- printk/vsprintf changes

- fiddle with seq_printf() return value

* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...

Linus Torvalds
2015-04-16 07:39:15 +0800
160a117f0 zsmalloc: remove extra cond_resched() in __zs_compact ... Browse Code »

Do not perform cond_resched() before the busy compaction loop in
__zs_compact(), because this loop does it when needed.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:22 +0800
81da9b13f zsmalloc: fix fatal corruption due to wrong size class selection ... Browse Code »

There is no point in overriding the size class below. It causes fatal
corruption on the next chunk on the 3264-bytes size class, which is the
last size class that is not huge.

For example, if the requested size was exactly 3264 bytes, current
zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
not 4096. User access to this chunk may overwrite head of the next
adjacent chunk.

Here is the panic log captured when freelist was corrupted due to this:

Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
Modules linked in:
exynos-snapshot: core register saved(CPU:5)
CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
exynos-snapshot: context saved(CPU:5)
exynos-snapshot: item - log_kevents is disabled
CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
PC is at obj_idx_to_offset+0x0/0x1c
LR is at obj_malloc+0x44/0xe8
pc : [] lr : [] pstate: a0000045
sp : ffffffc0b71eb790
x29: ffffffc0b71eb790 x28: ffffffc00204c000
x27: 000000000001d96f x26: 0000000000000000
x25: ffffffc098cc3500 x24: ffffffc0a13f2810
x23: ffffffc098cc3501 x22: ffffffc0a13f2800
x21: 000011e1a02006e3 x20: ffffffc0a13f2800
x19: ffffffbc02a7e000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000feb
x15: 0000000000000000 x14: 00000000a01003e3
x13: 0000000000000020 x12: fffffffffffffff0
x11: ffffffc08b264000 x10: 00000000e3a01004
x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
x7 : ffffffc000307d24 x6 : 0000000000000000
x5 : 0000000000000038 x4 : 000000000000011e
x3 : ffffffbc00003e90 x2 : 0000000000000cc0
x1 : 00000000d0100371 x0 : ffffffbc00003e90

Reported-by: Sooyong Suk
Signed-off-by: Heesub Shin
Tested-by: Sooyong Suk
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heesub Shin
2015-04-16 07:35:22 +0800
839373e64 zsmalloc: remove unnecessary insertion/removal of zspage in compaction ... Browse Code »

In putback_zspage, we don't need to insert a zspage into list of zspage
in size_class again to just fix fullness group. We could do directly
without reinsertion so we could save some instuctions.

Reported-by: Heesub Shin
Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Ganesh Mahendran
Cc: Luigi Semenzato
Cc: Gunho Lee
Cc: Juneho Choi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:22 +0800
495819ead zsmalloc: micro-optimize zs_object_copy() ... Browse Code »

A micro-optimization. Avoid additional branching and reduce (a bit)
registry pressure (f.e. s_off += size; d_off += size; may be calculated
twise: first for >= PAGE_SIZE check and later for offset update in "else"
clause).

scripts/bloat-o-meter shows some improvement

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function old new delta
zs_object_copy 550 540 -10

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:22 +0800
1ec7cfb13 zsmalloc: remove synchronize_rcu from zs_compact() ... Browse Code »

Do not synchronize rcu in zs_compact(). Neither zsmalloc not
zram use rcu.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
888fa374e mm/zsmalloc.c: fix comment for get_pages_per_zspage ... Browse Code »

Signed-off-by: Yinghao Xie
Suggested-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghao Xie
2015-04-16 07:35:21 +0800
d02be50db zsmalloc: zsmalloc documentation ... Browse Code »

Create zsmalloc doc which explains design concept and stat information.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
248ca1b05 zsmalloc: add fullness into stat ... Browse Code »

During investigating compaction, fullness information of each class is
helpful for investigating how the compaction works well. With that, we
could know how compaction works well more clear on each size class.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
7b60a6852 zsmalloc: record handle in page->private for huge object ... Browse Code »

We store handle on header of each allocated object so it increases the
size of each object by sizeof(unsigned long).

If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
4104B-class to add handle.

However, 4104B-class has 1-pages_per_zspage so wasted size by internal
fragment is 8192 - 4104, which is terrible.

So this patch records the handle in page->private on such huge object(ie,
pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
object so we could use 4096B-class, not 4104B-class.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
d3d07c92f zsmalloc: adjust ZS_ALMOST_FULL ... Browse Code »

Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
under 1/4 used objects(ie, fullness_threshold_frac). It could make result
in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

This patch changes the rule so that zsmalloc makes zspage which has above
3/4 used object ZS_ALMOST_FULL so it could make tight packing.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
312fcae22 zsmalloc: support compaction ... Browse Code »

This patch provides core functions for migration of zsmalloc. Migraion
policy is simple as follows.

for each size class {
while {
src_page = get zs_page from ZS_ALMOST_EMPTY
if (!src_page)
break;
dst_page = get zs_page from ZS_ALMOST_FULL
if (!dst_page)
dst_page = get zs_page from ZS_ALMOST_EMPTY
if (!dst_page)
break;
migrate(from src_page, to dst_page);
}
}

For migration, we need to identify which objects in zspage are allocated
to migrate them out. We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient. Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.

This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration. During migration, we cannot
move objects user are using due to data coherency between old object and
new object.

[akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
c78062612 zsmalloc: factor out obj_[malloc|free] ... Browse Code »

In later patch, migration needs some part of functions in zs_malloc and
zs_free so this patch factor out them.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
2e40e163a zsmalloc: decouple handle and object ... Browse Code »

Recently, we started to use zram heavily and some of issues
popped.

1) external fragmentation

I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system. His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.

2) non-movable pages

Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently. However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction. I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.

3) internal fragmentation

zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap. One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc. For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.

This patchset is first step to support above issues. For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.

After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.

In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,

Before =
zram allocated object : 60212066 bytes
zram total used: 140103680 bytes
ratio: 42.98 percent
MemFree: 840192 kB

Compaction

After =
frag ratio after compaction
zram allocated object : 60212066 bytes
zram total used: 76185600 bytes
ratio: 79.03 percent
MemFree: 901932 kB

Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.

- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

frag ratio after swap fragment
used : 156677 kbytes
total: 166092 kbytes
frag_ratio : 94
meminfo before compaction
MemFree: 83724 kB
Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0

num_migrated : 23630
compaction done

frag ratio after compaction
used : 156673 kbytes
total: 160564 kbytes
frag_ratio : 97
meminfo after compaction
MemFree: 89060 kB
Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0

This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.

This patch (of 7):

Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.

This patch decouples handle and object via adding indirect layer. For
that, it allocates handle dynamically and returns it to user. The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.

With it, we can change object's position without changing handle itself.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
018e9a49a mm/compaction.c: fix "suitable_migration_target() unused" warning ... Browse Code »

mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

Reported-by: Fengguang Wu
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-04-16 07:35:20 +0800
dd9061846 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP ... Browse Code »

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file. An xfstest is
attached to this patchset that shows the failure and the fix. (A DAX
patch will follow)

This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
1 - The name ;-)
2 - But mainly because it would take a very long and tedious
audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
users. To make sure they do not now CRASH. For example current
DAX code (which this is for) would crash.
If we would want to reuse page_mkwrite, We will need to first
patch all users, so to not-crash-on-no-page. Then enable this
patch. But even if I did that I would not sleep so well at night.
Adding a new vector is the safest thing to do, and is not that
expensive. an extra pointer at a static function vector per driver.
Also the new vector is better for performance, because else we
Will call all current Kernel vectors, so to:
check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman
Signed-off-by: Boaz Harrosh
Acked-by: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Jan Kara
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2015-04-16 07:35:20 +0800
2682582a6 mm/memory: also print a_ops->readpage in print_bad_pte() ... Browse Code »

A lot of filesystems use generic_file_mmap() and filemap_fault(),
f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
(which is almost always implemented and filesystem-specific).

Example:

[ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
[ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
[ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
[ 23.677896] page dumped because: bad pte
[ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
[ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

[akpm@linux-foundation.org: use pr_alert, per Kirill]
Signed-off-by: Konstantin Khlebnikov
Cc: Sasha Levin
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-04-16 07:35:20 +0800
923936157 mm/mempool.c: kasan: poison mempool elements ... Browse Code »

Mempools keep allocated objects in reserved for situations when ordinary
allocation may not be possible to satisfy. These objects shouldn't be
accessed before they leave the pool.

This patch poison elements when get into the pool and unpoison when they
leave it. This will let KASan to detect use-after-free of mempool's
elements.

Signed-off-by: Andrey Ryabinin
Tested-by: David Rientjes
Cc: Catalin Marinas
Cc: Dmitry Chernenkov
Cc: Dmitry Vyukov
Cc: Alexander Potapenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-04-16 07:35:20 +0800
bda6d3304 mm/cma_debug.c: remove blank lines before DEFINE_SIMPLE_ATTRIBUTE() ... Browse Code »

Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
to the immediately preceding function.

Cc: Dmitry Safonov
Cc: Michal Nazarewicz
Cc: Stefan Strogin
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Cc: Pintu Kumar
Cc: Weijie Yang
Cc: Laurent Pinchart
Cc: Vyacheslav Tyrtov
Cc: Aleksei Mateosian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-04-16 07:35:20 +0800
2e32b9476 mm: cma: add functions to get region pages counters ... Browse Code »

Here are two functions that provide interface to compute/get used size and
size of biggest free chunk in cma region. Add that information to
debugfs.

[akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
[stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
Signed-off-by: Dmitry Safonov
Signed-off-by: Stefan Strogin
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Cc: Pintu Kumar
Cc: Weijie Yang
Cc: Laurent Pinchart
Cc: Vyacheslav Tyrtov
Cc: Aleksei Mateosian
Signed-off-by: Stefan Strogin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2015-04-16 07:35:20 +0800