Eric Lee / smarc-fsl-linux-kernel

11 Dec, 2014

40 commits

c6cb898b5 binfmt_misc: replace get_unused_fd() with get_unused_fd_flags(0) ... Browse Code »

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.

Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2014-12-11 09:41:10 +0800
6b9cdf39c ppc/cell: replace get_unused_fd() with get_unused_fd_flags(0) ... Browse Code »

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.

Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2014-12-11 09:41:10 +0800
aeb682dd1 ia64: replace get_unused_fd() with get_unused_fd_flags(0) ... Browse Code »

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.

Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2014-12-11 09:41:10 +0800
7c8bd2322 exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() ... Browse Code »

Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.

Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:10 +0800
2831096e2 exit: reparent: cleanup the usage of reparent_leader() ... Browse Code »

1. Now that reparent_leader() doesn't abuse ->sibling we can shift
list_move_tail() from reparent_leader() to forget_original_parent()
and turn it into a single list_splice_tail_init(). This also makes
BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.

2. This also allows to shift the same_thread_group() check, it looks
a bit more clear in the caller.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:10 +0800
57a059187 exit: reparent: cleanup the changing of ->parent ... Browse Code »

1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
We need to change t->parent if and only if t is not traced.

2. If we actually want this BUG_ON() to ensure that parent/ptrace
match each other, then we should also take ptrace_reparented()
case into account too.

3. Change this code to use for_each_thread() instead of deprecated
while_each_thread().

[dan.carpenter@oracle.com: silence a bogus static checker warning]
Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:10 +0800
dc2fd4b00 exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks ... Browse Code »

reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
into dead_children list we are going to release. This obviously removes
the dead task from its real_parent->children list and this is even good;
the parent can do nothing with the EXIT_DEAD reparented zombie, it only
makes do_wait() slower.

But, this also means that it can not be reparented once again, so if its
new parent dies too nobody will update ->parent/real_parent, they can
point to the freed memory even before release_task() we are going to call,
this breaks the code which relies on pid_alive() to access
->real_parent/parent.

Fortunately this is mostly theoretical, this can only happen if init or
PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.

Change this code to use ->ptrace_entry instead, we know that the child is
not traced so nobody can ever use this member. This also allows to unify
this logic with exit_ptrace(), see the next changes.

Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow. And it would be better to reap this zombie immediately but
release_task_locked() we need is complicated by proc_flush_task().

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:10 +0800
a90e984c8 sched_show_task: fix unsafe usage of ->real_parent ... Browse Code »

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.

Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Acked-by: Peter Zijlstra (Intel)
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
abdba6e9e proc: task_state: ptrace_parent() doesn't need pid_alive() check ... Browse Code »

p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
b0fafc111 proc: task_state: move the main seq_printf() outside of rcu_read_lock() ... Browse Code »

task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Reviewed-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
0f4a0d53f proc: task_state: deuglify the max_fds calculation ... Browse Code »

1. The usage of fdt looks very ugly, it can't be NULL if ->files is
not NULL. We can use "unsigned int max_fds" instead.

2. This also allows to move seq_printf(max_fds) outside of task_lock()
and join it with the previous seq_printf(). See also the next patch.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
4af1036df proc: task_state: read cred->group_info outside of task_lock() ... Browse Code »

task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
2fc1e948e fs/proc.c: use rb_entry_safe() instead of rb_entry() ... Browse Code »

Better to use existing macro that rewriting them.

Signed-off-by: Nicolas Dichtel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Dichtel
2014-12-11 09:41:09 +0800
b208d54b7 procfs: fix error handling of proc_register() ... Browse Code »

proc_register() error paths are leaking inodes and directory refcounts.

Signed-off-by: Debabrata Banerjee
Cc: Alexander Viro
Acked-by: Nicolas Dichtel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Debabrata Banerjee
2014-12-11 09:41:09 +0800
710585d49 fs/proc: use a rb tree for the directory entries ... Browse Code »

When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries. This serie aims to accelerate this part.

The current implementation for the directories in /proc is using a single
linked list. This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).

This patch replaces this linked list by a red-black tree.

Here are some numbers:

dummy30000.batch contains 30 000 times 'link add type dummy'.

Before the patch:
$ time ip -b dummy30000.batch
real 2m31.950s
user 0m0.440s
sys 2m21.440s
$ time rmmod dummy
real 1m35.764s
user 0m0.000s
sys 1m24.088s

After the patch:
$ time ip -b dummy30000.batch
real 2m0.874s
user 0m0.448s
sys 1m49.720s
$ time rmmod dummy
real 1m13.988s
user 0m0.000s
sys 1m1.008s

The idea of improving this part was suggested by Thierry Herbelot.

[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel
Acked-by: David S. Miller
Cc: Thierry Herbelot .
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Dichtel
2014-12-11 09:41:09 +0800
9edad6ea0 mm: move page->mem_cgroup bad page handling into generic code ... Browse Code »

Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Cc: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:09 +0800
5d1ea48bd mm: page_cgroup: rename file to mm/swap_cgroup.c ... Browse Code »

Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.

Rename it and move the conditional compilation into mm/Makefile.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:09 +0800
1306a85ae mm: embed the memcg pointer directly into struct page ... Browse Code »

Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.

There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.

text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new

[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Cc: Joonsoo Kim
Acked-by: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:09 +0800
22811c6bc mm: memcontrol: remove stale page_cgroup_lock comment ... Browse Code »

There is no cgroup-specific page lock anymore.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:08 +0800
a1ad28973 mm/frontswap.c: fix the condition in BUG_ON ... Browse Code »

The largest index of swap device is MAX_SWAPFILES-1. So the type should
be less than MAX_SWAPFILES.

Signed-off-by: Haifeng Li
Acked-by: Konrad Rzeszutek Wilk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Haifeng
2014-12-11 09:41:08 +0800
26086de3f mm: fix a spelling mistake ... Browse Code »

Signed-off-by Wei Yuan
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yuan
2014-12-11 09:41:08 +0800
569f48b85 mm: hugetlb: fix __unmap_hugepage_range() ... Browse Code »

First, after flushing TLB, we have no need to scan pte from start again.
Second, before bail out loop, the address is forwarded one step.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2014-12-11 09:41:08 +0800
e4bd6a024 mm, memcg: fix potential undefined behaviour in page stat accounting ... Browse Code »

Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.

Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.

I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:

UBSan: Undefined behaviour in mm/rmap.c:1084:2
load of value 255 is not a valid value for type '_Bool'
CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
Call Trace:
dump_stack (lib/dump_stack.c:52)
ubsan_epilogue (lib/ubsan.c:159)
__ubsan_handle_load_invalid_value (lib/ubsan.c:482)
page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
unmap_single_vma (mm/memory.c:1348)
unmap_vmas (mm/memory.c:1377 (discriminator 3))
exit_mmap (mm/mmap.c:2837)
mmput (kernel/fork.c:659)
do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
SyS_exit_group (kernel/exit.c:901)
tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.

Signed-off-by: Michal Hocko
Reported-by: Sasha Levin
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-12-11 09:41:08 +0800
c164e038e mm: fix huge zero page accounting in smaps report ... Browse Code »

As a small zero page, huge zero page should not be accounted in smaps
report as normal page.

For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.

Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.

We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.

[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov"
Reported-by: Fengguang Wu
Tested-by: Fengwei Yin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-12-11 09:41:08 +0800
2314b42db mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() ... Browse Code »

None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references. Remove it.

To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:08 +0800
413918bb6 mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree() ... Browse Code »

The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.

No other callsite passes NULL to __mem_cgroup_same_or_subtree().

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:08 +0800
c01f46c7c mm: memcontrol: remove bogus NULL check after mem_cgroup_from_task() ... Browse Code »

That function acts like a typecast - unless NULL is passed in, no NULL can
come out. task_in_mem_cgroup() callers don't pass NULL tasks.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:08 +0800
312722cbb mm: memcontrol: shorten the page statistics update slowpath ... Browse Code »

While moving charges from one memcg to another, page stat updates must
acquire the old memcg's move_lock to prevent double accounting. That
situation is denoted by an increased memcg->move_accounting. However, the
charge moving code declares this way too early for now, even before
summing up the RSS and pre-allocating destination charges.

Shorten this slowpath mode by increasing memcg->move_accounting only right
before walking the task's address space with the intention of actually
moving the pages.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:08 +0800
e544a4e74 thp: do not mark zero-page pmd write-protected explicitly ... Browse Code »

Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.

Let's drop redundant pmd_wrprotect() in set_huge_zero_page().

Signed-off-by: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-12-11 09:41:08 +0800
b047501cd memcg: use generic slab iterators for showing slabinfo ... Browse Code »

Let's use generic slab_start/next/stop for showing memcg caches info. In
contrast to the current implementation, this will work even if all memcg
caches' info doesn't fit into a seq buffer (a page), plus it simply looks
neater.

Actually, the main reason I do this isn't mere cleanup. I'm going to zap
the memcg_slab_caches list, because I find it useless provided we have the
slab_caches list, and this patch is a step in this direction.

It should be noted that before this patch an attempt to read
memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
in -EIO, while after this patch it will silently show nothing except the
header, but I don't think it will frustrate anyone.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-12-11 09:41:07 +0800
4ef461e8f memcg: remove mem_cgroup_reclaimable check from soft reclaim ... Browse Code »

mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on
*any* NUMA node. However, the only place where it's called is
mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific*
zone. So the way it is used is incorrect - it will return true even if
the cgroup doesn't have pages on the zone we're scanning.

I think we can get rid of this check completely, because
mem_cgroup_shrink_node_zone(), which is called by
mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
equivalent to shrink_lruvec(), which exits almost immediately if the
lruvec passed to it is empty. So there's no need to optimize anything
here. Besides, we don't have such a check in the general scan path
(shrink_zone) either.

Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-12-11 09:41:07 +0800
97ad2be1d mm, hugetlb: correct bit shift in hstate_sizelog() ... Browse Code »

hstate_sizelog() would shift left an int rather than long, triggering
undefined behaviour and passing an incorrect value when the requested
page size was more than 4GB, thus breaking >4GB pages.

Signed-off-by: Sasha Levin
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-12-11 09:41:07 +0800
247b1447b mm: memcontrol: fold mem_cgroup_start_move()/mem_cgroup_end_move() ... Browse Code »

Having these functions and their documentation split out and somewhere
makes it harder, not easier, to follow what's going on.

Inline them directly where charge moving is prepared and finished, and put
an explanation right next to it.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
4e2f245d3 mm: memcontrol: don't pass a NULL memcg to mem_cgroup_end_move() ... Browse Code »

mem_cgroup_end_move() checks if the passed memcg is NULL, along with a
lengthy comment to explain why this seemingly non-sensical situation is
even possible.

Check in cancel_attach() itself whether can_attach() set up the move
context or not, it's a lot more obvious from there. Then remove the check
and comment in mem_cgroup_end_move().

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
354a4783a mm: memcontrol: inline memcg->move_lock locking ... Browse Code »

The wrappers around taking and dropping the memcg->move_lock spinlock add
nothing of value. Inline the spinlock calls into the callsites.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
298333157 mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag ... Browse Code »

pc->mem_cgroup had to be left intact after uncharge for the final LRU
removal, and !PCG_USED indicated whether the page was uncharged. But
since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages
are uncharged after the final LRU removal. Uncharge can simply clear
the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.

Because this is the last page_cgroup flag, this patch reduces the memcg
per-page overhead to a single pointer.

[akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
f4aaa8b43 mm: memcontrol: remove unnecessary PCG_MEM memory charge flag ... Browse Code »

PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm:
memcontrol: rewrite uncharge API"), used to tell whether migration cleared
a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the
final version, mem_cgroup_migrate() directly uncharges the source page,
rendering this distinction unnecessary. Remove it.

Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
18eca2e63 mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flag ... Browse Code »

Now that mem_cgroup_swapout() fully uncharges the page, every page that is
still in use when reaching mem_cgroup_uncharge() is known to carry both
the memory and the memory+swap charge. Simplify the uncharge path and
remove the PCG_MEMSW page flag accordingly.

Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
7bdd143c3 mm: memcontrol: uncharge pages on swapout ... Browse Code »

This series gets rid of the remaining page_cgroup flags, thus cutting the
memcg per-page overhead down to one pointer.

This patch (of 4):

mem_cgroup_swapout() is called with exclusive access to the page at the
end of the page's lifetime. Instead of clearing the PCG_MEMSW flag and
deferring the uncharge, just do it right away. This allows follow-up
patches to simplify the uncharge code.

Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:07 +0800
b9982f8d2 mm: memcontrol: micro-optimize mem_cgroup_split_huge_fixup() ... Browse Code »

Don't call lookup_page_cgroup() when memcg is disabled.

Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-12-11 09:41:07 +0800