Eric Lee / smarc-fsl-linux-kernel

16 Apr, 2015

40 commits

972fae699 kernel/hung_task.c: change hung_task.c to use for_each_process_thread() ... Browse Code »

In check_hung_uninterruptible_tasks() avoid the use of deprecated
while_each_thread().

The "max_count" logic will prevent a livelock - see commit 0c740d0a
("introduce for_each_thread() to replace the buggy while_each_thread()").
Having said this let's use for_each_process_thread().

Signed-off-by: Aaron Tomlin
Acked-by: Oleg Nesterov
Cc: David Rientjes
Cc: Dave Wysochanski
Cc: Aaron Tomlin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Tomlin
2015-04-16 07:35:22 +0800
96831c0a6 kernel/resource.c: remove deprecated __check_region() and friends ... Browse Code »

All users of __check_region(), check_region(), and check_mem_region() are
gone. We got rid of the last user in v4.0-rc1. Remove them.

bloat-o-meter on x86_64 shows:

add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function old new delta
__kstrtab___check_region 15 - -15
__ksymtab___check_region 16 - -16
__check_region 71 - -71

Signed-off-by: Jakub Sitnicki
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jakub Sitnicki
2015-04-16 07:35:22 +0800
2813893f8 kernel: conditionally support non-root users, groups and capabilities ... Browse Code »

There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda
Reviewed-by: Josh Triplett
Acked-by: Geert Uytterhoeven
Tested-by: Paul E. McKenney
Reviewed-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Iulia Manda
2015-04-16 07:35:22 +0800
c79574abe lib/test-hexdump.c: fix initconst confusion ... Browse Code »

const char *...[] is not const, but an array of pointer to const. So
these arrays cannot be __initconst, but must be __initdata

This fixes section conflicts with LTO.

Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2015-04-16 07:35:22 +0800
946e87981 paride: fix the "verbose" module param ... Browse Code »

The verbose module parameter can be set to 2 for extremely verbose
messages so the type should be int instead of bool.

Signed-off-by: Dan Carpenter
Cc: Tim Waugh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2015-04-16 07:35:22 +0800
23f40a94d include/linux: remove empty conditionals ... Browse Code »

Commit 607ca46e97a1 ("UAPI: (Scripted) Disintegrate include/linux") left
behind some empty conditional blocks. Since they are useless and may
cause a reader to wonder whether something is missing, remove them.

Signed-off-by: Rasmus Villemoes
Cc: Geert Uytterhoeven
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-04-16 07:35:22 +0800
e4bc33245 /proc/PID/status: show all sets of pid according to ns ... Browse Code »

If some issues occurred inside a container guest, host user could not know
which process is in trouble just by guest pid: the users of container
guest only knew the pid inside containers. This will bring obstacle for
trouble shooting.

This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

a) In init_pid_ns, nothing changed;

b) In one pidns, will tell the pid inside containers:
NStgid: 21776 5 1
NSpid: 21776 5 1
NSpgid: 21776 5 1
NSsid: 21729 1 0
** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

c) If pidns is nested, it depends on which pidns are you in.
NStgid: 5 1
NSpid: 5 1
NSpgid: 5 1
NSsid: 1 0
** Views from level 1

[akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
Signed-off-by: Chen Hanxiao
Acked-by: Serge Hallyn
Acked-by: "Eric W. Biederman"
Tested-by: Serge Hallyn
Tested-by: Nathan Scott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Hanxiao
2015-04-16 07:35:22 +0800
201c7b72f zram: fix error return code ... Browse Code »

Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

//
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
//

Signed-off-by: Julia Lawall
Cc: Minchan Kim
Cc: Nitin Gupta
Acked-by: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2015-04-16 07:35:22 +0800
160a117f0 zsmalloc: remove extra cond_resched() in __zs_compact ... Browse Code »

Do not perform cond_resched() before the busy compaction loop in
__zs_compact(), because this loop does it when needed.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:22 +0800
81da9b13f zsmalloc: fix fatal corruption due to wrong size class selection ... Browse Code »

There is no point in overriding the size class below. It causes fatal
corruption on the next chunk on the 3264-bytes size class, which is the
last size class that is not huge.

For example, if the requested size was exactly 3264 bytes, current
zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
not 4096. User access to this chunk may overwrite head of the next
adjacent chunk.

Here is the panic log captured when freelist was corrupted due to this:

Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
Modules linked in:
exynos-snapshot: core register saved(CPU:5)
CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
exynos-snapshot: context saved(CPU:5)
exynos-snapshot: item - log_kevents is disabled
CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
PC is at obj_idx_to_offset+0x0/0x1c
LR is at obj_malloc+0x44/0xe8
pc : [] lr : [] pstate: a0000045
sp : ffffffc0b71eb790
x29: ffffffc0b71eb790 x28: ffffffc00204c000
x27: 000000000001d96f x26: 0000000000000000
x25: ffffffc098cc3500 x24: ffffffc0a13f2810
x23: ffffffc098cc3501 x22: ffffffc0a13f2800
x21: 000011e1a02006e3 x20: ffffffc0a13f2800
x19: ffffffbc02a7e000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000feb
x15: 0000000000000000 x14: 00000000a01003e3
x13: 0000000000000020 x12: fffffffffffffff0
x11: ffffffc08b264000 x10: 00000000e3a01004
x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
x7 : ffffffc000307d24 x6 : 0000000000000000
x5 : 0000000000000038 x4 : 000000000000011e
x3 : ffffffbc00003e90 x2 : 0000000000000cc0
x1 : 00000000d0100371 x0 : ffffffbc00003e90

Reported-by: Sooyong Suk
Signed-off-by: Heesub Shin
Tested-by: Sooyong Suk
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heesub Shin
2015-04-16 07:35:22 +0800
839373e64 zsmalloc: remove unnecessary insertion/removal of zspage in compaction ... Browse Code »

In putback_zspage, we don't need to insert a zspage into list of zspage
in size_class again to just fix fullness group. We could do directly
without reinsertion so we could save some instuctions.

Reported-by: Heesub Shin
Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Ganesh Mahendran
Cc: Luigi Semenzato
Cc: Gunho Lee
Cc: Juneho Choi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:22 +0800
495819ead zsmalloc: micro-optimize zs_object_copy() ... Browse Code »

A micro-optimization. Avoid additional branching and reduce (a bit)
registry pressure (f.e. s_off += size; d_off += size; may be calculated
twise: first for >= PAGE_SIZE check and later for offset update in "else"
clause).

scripts/bloat-o-meter shows some improvement

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function old new delta
zs_object_copy 550 540 -10

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:22 +0800
1ec7cfb13 zsmalloc: remove synchronize_rcu from zs_compact() ... Browse Code »

Do not synchronize rcu in zs_compact(). Neither zsmalloc not
zram use rcu.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
8f7d282c7 zram: deprecate zram attrs sysfs nodes ... Browse Code »

Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and
deprecated attributes there. The patch also adds additional information
to zram documentation and describes the basic strategy:

- the existing RW nodes will be downgraded to WO nodes (in 4.11)
- deprecated RO sysfs nodes will eventually be removed (in 4.11)

Users will be additionally notified about deprecated attr usage by
pr_warn_once() (added to every deprecated attr _show()), as suggested by
Minchan Kim.

User space is advised to use zram/stat, zram/io_stat and
zram/mm_stat files.

Signed-off-by: Sergey Senozhatsky
Reported-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
4f2109f60 zram: export new 'mm_stat' sysfs attrs ... Browse Code »

Per-device `zram/mm_stat' file provides mm statistics of a particular
zram device in a format similar to block layer statistics. The file
consists of a single line and represents the following stats (separated by
whitespace):

orig_data_size
compr_data_size
mem_used_total
mem_limit
mem_used_max
zero_pages
num_migrated

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
2f6a3bed7 zram: export new 'io_stat' sysfs attrs ... Browse Code »

Per-device `zram/io_stat' file provides accumulated I/O statistics of
particular zram device in a format similar to block layer statistics. The
file consists of a single line and represents the following stats
(separated by whitespace):

failed_reads
failed_writes
invalid_io
notify_free

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
77ba015f9 zram: describe device attrs in documentation ... Browse Code »

Briefly describe exported device stat attrs in zram documentation. We
will eventually get rid of per-stat sysfs nodes and, thus, clean up
Documentation/ABI/testing/sysfs-block-zram file, which is the only source
of information about device sysfs nodes.

Add `num_migrated' description, since there is no independent
`num_migrated' sysfs node (and no corresponding sysfs-block-zram entry),
it will be exported via zram/mm_stat file.

At this point we can provide minimal description, because sysfs-block-zram
still contains detailed information.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
8811a9421 zram: use generic start/end io accounting ... Browse Code »

Use bio generic_start_io_acct() and generic_end_io_acct() to account
device's block layer statistics. This will let users to monitor zram
activities using sysstat and similar packages/tools.

Apart from the usual per-stat sysfs attr, zram IO stats are now also
available in '/sys/block/zram/stat' and '/proc/diskstats' files.

We will slowly get rid of per-stat sysfs files.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
c72c6160d zram: move compact_store() to sysfs functions area ... Browse Code »

A cosmetic change. We have a new code layout and keep zram per-device
sysfs store and show functions in one place. Move compact_store() to that
handlers block to conform to current layout.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
10447b60b zram: remove `num_migrated' device attr ... Browse Code »

This patch introduces rework to zram stats. We have per-stat sysfs nodes,
and it makes things a bit hard to use in user space: it doesn't give an
immediate stats 'snapshot', it requires user space to use more syscalls -
open, read, close for every stat file, with appropriate error checks on
every step, etc.

First, zram now accounts block layer statistics, available in
/sys/block/zram/stat and /proc/diskstats files. So some new stats are
available (see Documentation/block/stat.txt), besides, zram's activities
now can be monitored by sysstat's iostat or similar tools.

Example:
cat /sys/block/zram0/stat
248 0 1984 0 251029 0 2008232 5120 0 5116 5116

Second, group currently exported on per-stat basis nodes into two
categories (files):

-- zram/io_stat
accumulates device's IO stats, that are not accounted by block layer,
and contains:
failed_reads
failed_writes
invalid_io
notify_free

Example:
cat /sys/block/zram0/io_stat
0 0 0 652572

-- zram/mm_stat
accumulates zram mm stats and contains:
orig_data_size
compr_data_size
mem_used_total
mem_limit
mem_used_max
zero_pages
num_migrated

Example:
cat /sys/block/zram0/mm_stat
434634752 270288572 279158784 0 579895296 15060 0

per-stat sysfs nodes are now considered to be deprecated and we plan to
remove them (and clean up some of the existing stat code) in two years (as
of now, there is no warning printed to syslog about deprecated stats being
used). User space is advised to use the above mentioned 3 files.

This patch (of 7):

Remove sysfs `num_migrated' attribute. We are moving away from per-stat
device attrs towards 3 stat files that will accumulate io and mm stats in
a format similar to block layer statistics in /sys/block//stat. That
will be easier to use in user space, and reduce the number of syscalls
needed to read zram device statistics.

`num_migrated' will return back in zram/mm_stat file.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-04-16 07:35:21 +0800
888fa374e mm/zsmalloc.c: fix comment for get_pages_per_zspage ... Browse Code »

Signed-off-by: Yinghao Xie
Suggested-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghao Xie
2015-04-16 07:35:21 +0800
d02be50db zsmalloc: zsmalloc documentation ... Browse Code »

Create zsmalloc doc which explains design concept and stat information.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
248ca1b05 zsmalloc: add fullness into stat ... Browse Code »

During investigating compaction, fullness information of each class is
helpful for investigating how the compaction works well. With that, we
could know how compaction works well more clear on each size class.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
7b60a6852 zsmalloc: record handle in page->private for huge object ... Browse Code »

We store handle on header of each allocated object so it increases the
size of each object by sizeof(unsigned long).

If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
4104B-class to add handle.

However, 4104B-class has 1-pages_per_zspage so wasted size by internal
fragment is 8192 - 4104, which is terrible.

So this patch records the handle in page->private on such huge object(ie,
pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
object so we could use 4096B-class, not 4104B-class.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
4e3ba8784 zram: support compaction ... Browse Code »

Now that zsmalloc supports compaction, zram can use it. For the first
step, this patch exports compact knob via sysfs so user can do compaction
via "echo 1 > /sys/block/zram0/compact".

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
d3d07c92f zsmalloc: adjust ZS_ALMOST_FULL ... Browse Code »

Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
under 1/4 used objects(ie, fullness_threshold_frac). It could make result
in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

This patch changes the rule so that zsmalloc makes zspage which has above
3/4 used object ZS_ALMOST_FULL so it could make tight packing.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
312fcae22 zsmalloc: support compaction ... Browse Code »

This patch provides core functions for migration of zsmalloc. Migraion
policy is simple as follows.

for each size class {
while {
src_page = get zs_page from ZS_ALMOST_EMPTY
if (!src_page)
break;
dst_page = get zs_page from ZS_ALMOST_FULL
if (!dst_page)
dst_page = get zs_page from ZS_ALMOST_EMPTY
if (!dst_page)
break;
migrate(from src_page, to dst_page);
}
}

For migration, we need to identify which objects in zspage are allocated
to migrate them out. We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient. Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.

This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration. During migration, we cannot
move objects user are using due to data coherency between old object and
new object.

[akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
c78062612 zsmalloc: factor out obj_[malloc|free] ... Browse Code »

In later patch, migration needs some part of functions in zs_malloc and
zs_free so this patch factor out them.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
2e40e163a zsmalloc: decouple handle and object ... Browse Code »

Recently, we started to use zram heavily and some of issues
popped.

1) external fragmentation

I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system. His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.

2) non-movable pages

Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently. However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction. I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.

3) internal fragmentation

zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap. One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc. For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.

This patchset is first step to support above issues. For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.

After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.

In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,

Before =
zram allocated object : 60212066 bytes
zram total used: 140103680 bytes
ratio: 42.98 percent
MemFree: 840192 kB

Compaction

After =
frag ratio after compaction
zram allocated object : 60212066 bytes
zram total used: 76185600 bytes
ratio: 79.03 percent
MemFree: 901932 kB

Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.

- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

frag ratio after swap fragment
used : 156677 kbytes
total: 166092 kbytes
frag_ratio : 94
meminfo before compaction
MemFree: 83724 kB
Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0

num_migrated : 23630
compaction done

frag ratio after compaction
used : 156673 kbytes
total: 160564 kbytes
frag_ratio : 97
meminfo after compaction
MemFree: 89060 kB
Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0

This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.

This patch (of 7):

Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.

This patch decouples handle and object via adding indirect layer. For
that, it allocates handle dynamically and returns it to user. The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.

With it, we can change object's position without changing handle itself.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:20 +0800
018e9a49a mm/compaction.c: fix "suitable_migration_target() unused" warning ... Browse Code »

mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

Reported-by: Fengguang Wu
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-04-16 07:35:20 +0800
be64f884b dax: unify ext2/4_{dax,}_file_operations ... Browse Code »

The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).

Signed-off-by: Boaz Harrosh
Reviewed-by: Jan Kara
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2015-04-16 07:35:20 +0800
0e3b210ce dax: use pfn_mkwrite to update c/mtime + freeze protection ... Browse Code »

From: Yigal Korman

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Signed-off-by: Yigal Korman
Signed-off-by: Boaz Harrosh
Reviewed-by: Jan Kara
Cc: Matthew Wilcox
Cc: Dave Chinner
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2015-04-16 07:35:20 +0800
dd9061846 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP ... Browse Code »

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file. An xfstest is
attached to this patchset that shows the failure and the fix. (A DAX
patch will follow)

This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
1 - The name ;-)
2 - But mainly because it would take a very long and tedious
audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
users. To make sure they do not now CRASH. For example current
DAX code (which this is for) would crash.
If we would want to reuse page_mkwrite, We will need to first
patch all users, so to not-crash-on-no-page. Then enable this
patch. But even if I did that I would not sleep so well at night.
Adding a new vector is the safest thing to do, and is not that
expensive. an extra pointer at a static function vector per driver.
Also the new vector is better for performance, because else we
Will call all current Kernel vectors, so to:
check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman
Signed-off-by: Boaz Harrosh
Acked-by: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Jan Kara
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2015-04-16 07:35:20 +0800
2682582a6 mm/memory: also print a_ops->readpage in print_bad_pte() ... Browse Code »

A lot of filesystems use generic_file_mmap() and filemap_fault(),
f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
(which is almost always implemented and filesystem-specific).

Example:

[ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
[ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
[ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
[ 23.677896] page dumped because: bad pte
[ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
[ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

[akpm@linux-foundation.org: use pr_alert, per Kirill]
Signed-off-by: Konstantin Khlebnikov
Cc: Sasha Levin
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-04-16 07:35:20 +0800
923936157 mm/mempool.c: kasan: poison mempool elements ... Browse Code »

Mempools keep allocated objects in reserved for situations when ordinary
allocation may not be possible to satisfy. These objects shouldn't be
accessed before they leave the pool.

This patch poison elements when get into the pool and unpoison when they
leave it. This will let KASan to detect use-after-free of mempool's
elements.

Signed-off-by: Andrey Ryabinin
Tested-by: David Rientjes
Cc: Catalin Marinas
Cc: Dmitry Chernenkov
Cc: Dmitry Vyukov
Cc: Alexander Potapenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-04-16 07:35:20 +0800
bda6d3304 mm/cma_debug.c: remove blank lines before DEFINE_SIMPLE_ATTRIBUTE() ... Browse Code »

Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
to the immediately preceding function.

Cc: Dmitry Safonov
Cc: Michal Nazarewicz
Cc: Stefan Strogin
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Cc: Pintu Kumar
Cc: Weijie Yang
Cc: Laurent Pinchart
Cc: Vyacheslav Tyrtov
Cc: Aleksei Mateosian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-04-16 07:35:20 +0800
2e32b9476 mm: cma: add functions to get region pages counters ... Browse Code »

Here are two functions that provide interface to compute/get used size and
size of biggest free chunk in cma region. Add that information to
debugfs.

[akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
[stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
Signed-off-by: Dmitry Safonov
Signed-off-by: Stefan Strogin
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Cc: Pintu Kumar
Cc: Weijie Yang
Cc: Laurent Pinchart
Cc: Vyacheslav Tyrtov
Cc: Aleksei Mateosian
Signed-off-by: Stefan Strogin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2015-04-16 07:35:20 +0800
79553da29 thp: cleanup khugepaged startup ... Browse Code »

Few trivial cleanups:

- no need to call set_recommended_min_free_kbytes() from
late_initcall() -- start_khugepaged() calls it;

- no need to call set_recommended_min_free_kbytes() from
start_khugepaged() if khugepaged is not started;

- there isn't much point in running start_khugepaged() if we've just
set transparent_hugepage_flags to zero;

- start_khugepaged() is misnamed -- it also used to stop the thread;

Signed-off-by: Kirill A. Shutemov
Cc: David Rientjes
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-16 07:35:19 +0800
e39155ea1 mm: uninline and cleanup page-mapping related helpers ... Browse Code »

Most-used page->mapping helper -- page_mapping() -- has already uninlined.
Let's uninline also page_rmapping() and page_anon_vma(). It saves us
depending on configuration around 400 bytes in text:

text data bss dec hex filename
660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
659854 99254 410000 1169108 11d6d4 mm/built-in.o

I also tried to make code a bit more clean.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Konstantin Khlebnikov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-16 07:35:19 +0800
99e8ea6cd mm: cma: add trace events for CMA allocations and freeings ... Browse Code »

Add trace events for cma_alloc() and cma_release().

The cma_alloc tracepoint is used both for successful and failed allocations,
in case of allocation failure pfn=-1UL is stored and printed.

Signed-off-by: Stefan Strogin
Cc: Ingo Molnar
Cc: Steven Rostedt
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Laurent Pinchart
Cc: Thierry Reding
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefan Strogin
2015-04-16 07:35:19 +0800