11 Dec, 2014
40 commits
-
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. Now that reparent_leader() doesn't abuse ->sibling we can shift
list_move_tail() from reparent_leader() to forget_original_parent()
and turn it into a single list_splice_tail_init(). This also makes
BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.2. This also allows to shift the same_thread_group() check, it looks
a bit more clear in the caller.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
We need to change t->parent if and only if t is not traced.2. If we actually want this BUG_ON() to ensure that parent/ptrace
match each other, then we should also take ptrace_reparented()
case into account too.3. Change this code to use for_each_thread() instead of deprecated
while_each_thread().[dan.carpenter@oracle.com: silence a bogus static checker warning]
Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
into dead_children list we are going to release. This obviously removes
the dead task from its real_parent->children list and this is even good;
the parent can do nothing with the EXIT_DEAD reparented zombie, it only
makes do_wait() slower.But, this also means that it can not be reparented once again, so if its
new parent dies too nobody will update ->parent/real_parent, they can
point to the freed memory even before release_task() we are going to call,
this breaks the code which relies on pid_alive() to access
->real_parent/parent.Fortunately this is mostly theoretical, this can only happen if init or
PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.Change this code to use ->ptrace_entry instead, we know that the child is
not traced so nobody can ever use this member. This also allows to unify
this logic with exit_ptrace(), see the next changes.Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow. And it would be better to reap this zombie immediately but
release_task_locked() we need is complicated by proc_flush_task().Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Acked-by: Peter Zijlstra (Intel)
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Reviewed-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
1. The usage of fdt looks very ugly, it can't be NULL if ->files is
not NULL. We can use "unsigned int max_fds" instead.2. This also allows to move seq_printf(max_fds) outside of task_lock()
and join it with the previous seq_printf(). See also the next patch.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Better to use existing macro that rewriting them.
Signed-off-by: Nicolas Dichtel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
proc_register() error paths are leaking inodes and directory refcounts.
Signed-off-by: Debabrata Banerjee
Cc: Alexander Viro
Acked-by: Nicolas Dichtel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries. This serie aims to accelerate this part.The current implementation for the directories in /proc is using a single
linked list. This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).This patch replaces this linked list by a red-black tree.
Here are some numbers:
dummy30000.batch contains 30 000 times 'link add type dummy'.
Before the patch:
$ time ip -b dummy30000.batch
real 2m31.950s
user 0m0.440s
sys 2m21.440s
$ time rmmod dummy
real 1m35.764s
user 0m0.000s
sys 1m24.088sAfter the patch:
$ time ip -b dummy30000.batch
real 2m0.874s
user 0m0.448s
sys 1m49.720s
$ time rmmod dummy
real 1m13.988s
user 0m0.000s
sys 1m1.008sThe idea of improving this part was suggested by Thierry Herbelot.
[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel
Acked-by: David S. Miller
Cc: Thierry Herbelot .
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Cc: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Cc: Joonsoo Kim
Acked-by: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is no cgroup-specific page lock anymore.
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The largest index of swap device is MAX_SWAPFILES-1. So the type should
be less than MAX_SWAPFILES.Signed-off-by: Haifeng Li
Acked-by: Konrad Rzeszutek Wilk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by Wei Yuan
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
First, after flushing TLB, we have no need to scan pte from start again.
Second, before bail out loop, the address is forwarded one step.Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:UBSan: Undefined behaviour in mm/rmap.c:1084:2
load of value 255 is not a valid value for type '_Bool'
CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
Call Trace:
dump_stack (lib/dump_stack.c:52)
ubsan_epilogue (lib/ubsan.c:159)
__ubsan_handle_load_invalid_value (lib/ubsan.c:482)
page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
unmap_single_vma (mm/memory.c:1348)
unmap_vmas (mm/memory.c:1377 (discriminator 3))
exit_mmap (mm/mmap.c:2837)
mmput (kernel/fork.c:659)
do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
SyS_exit_group (kernel/exit.c:901)
tracesys_phase2 (arch/x86/kernel/entry_64.S:529)Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.Signed-off-by: Michal Hocko
Reported-by: Sasha Levin
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov"
Reported-by: Fengguang Wu
Tested-by: Fengwei Yin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references. Remove it.To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.No other callsite passes NULL to __mem_cgroup_same_or_subtree().
Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
That function acts like a typecast - unless NULL is passed in, no NULL can
come out. task_in_mem_cgroup() callers don't pass NULL tasks.Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds -
While moving charges from one memcg to another, page stat updates must
acquire the old memcg's move_lock to prevent double accounting. That
situation is denoted by an increased memcg->move_accounting. However, the
charge moving code declares this way too early for now, even before
summing up the RSS and pre-allocating destination charges.Shorten this slowpath mode by increasing memcg->move_accounting only right
before walking the task's address space with the intention of actually
moving the pages.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.Let's drop redundant pmd_wrprotect() in set_huge_zero_page().
Signed-off-by: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let's use generic slab_start/next/stop for showing memcg caches info. In
contrast to the current implementation, this will work even if all memcg
caches' info doesn't fit into a seq buffer (a page), plus it simply looks
neater.Actually, the main reason I do this isn't mere cleanup. I'm going to zap
the memcg_slab_caches list, because I find it useless provided we have the
slab_caches list, and this patch is a step in this direction.It should be noted that before this patch an attempt to read
memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
in -EIO, while after this patch it will silently show nothing except the
header, but I don't think it will frustrate anyone.Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on
*any* NUMA node. However, the only place where it's called is
mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific*
zone. So the way it is used is incorrect - it will return true even if
the cgroup doesn't have pages on the zone we're scanning.I think we can get rid of this check completely, because
mem_cgroup_shrink_node_zone(), which is called by
mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
equivalent to shrink_lruvec(), which exits almost immediately if the
lruvec passed to it is empty. So there's no need to optimize anything
here. Besides, we don't have such a check in the general scan path
(shrink_zone) either.Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
hstate_sizelog() would shift left an int rather than long, triggering
undefined behaviour and passing an incorrect value when the requested
page size was more than 4GB, thus breaking >4GB pages.Signed-off-by: Sasha Levin
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Having these functions and their documentation split out and somewhere
makes it harder, not easier, to follow what's going on.Inline them directly where charge moving is prepared and finished, and put
an explanation right next to it.Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_end_move() checks if the passed memcg is NULL, along with a
lengthy comment to explain why this seemingly non-sensical situation is
even possible.Check in cancel_attach() itself whether can_attach() set up the move
context or not, it's a lot more obvious from there. Then remove the check
and comment in mem_cgroup_end_move().Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The wrappers around taking and dropping the memcg->move_lock spinlock add
nothing of value. Inline the spinlock calls into the callsites.Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
pc->mem_cgroup had to be left intact after uncharge for the final LRU
removal, and !PCG_USED indicated whether the page was uncharged. But
since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages
are uncharged after the final LRU removal. Uncharge can simply clear
the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.Because this is the last page_cgroup flag, this patch reduces the memcg
per-page overhead to a single pointer.[akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm:
memcontrol: rewrite uncharge API"), used to tell whether migration cleared
a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the
final version, mem_cgroup_migrate() directly uncharges the source page,
rendering this distinction unnecessary. Remove it.Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that mem_cgroup_swapout() fully uncharges the page, every page that is
still in use when reaching mem_cgroup_uncharge() is known to carry both
the memory and the memory+swap charge. Simplify the uncharge path and
remove the PCG_MEMSW page flag accordingly.Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This series gets rid of the remaining page_cgroup flags, thus cutting the
memcg per-page overhead down to one pointer.This patch (of 4):
mem_cgroup_swapout() is called with exclusive access to the page at the
end of the page's lifetime. Instead of clearing the PCG_MEMSW flag and
deferring the uncharge, just do it right away. This allows follow-up
patches to simplify the uncharge code.Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Don't call lookup_page_cgroup() when memcg is disabled.
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds