11 Dec, 2014

40 commits

  • Use the more common pr_warn.

    Coalesce formats, realign arguments.

    Signed-off-by: Joe Perches
    Acked-by: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Coalesce the formats and align arguments.

    Signed-off-by: Joe Perches
    Acked-by: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Eliminate the unlikely possibility of message interleaving for
    early_printk/early_vprintk use.

    early_vprintk can be done via the %pV extension so remove this
    unnecessary function and change early_printk to have the equivalent
    vprintk code.

    All uses of early_printk already end with a newline so also remove the
    unnecessary newline from the early_printk function.

    Signed-off-by: Joe Perches
    Acked-by: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There have been several times where I have had to rebuild a kernel to
    cause a panic when hitting a WARN() in the code in order to get a crash
    dump from a system. Sometimes this is easy to do, other times (such as
    in the case of a remote admin) it is not trivial to send new images to
    the user.

    A much easier method would be a switch to change the WARN() over to a
    panic. This makes debugging easier in that I can now test the actual
    image the WARN() was seen on and I do not have to engage in remote
    debugging.

    This patch adds a panic_on_warn kernel parameter and
    /proc/sys/kernel/panic_on_warn calls panic() in the
    warn_slowpath_common() path. The function will still print out the
    location of the warning.

    An example of the panic_on_warn output:

    The first line below is from the WARN_ON() to output the WARN_ON()'s
    location. After that the panic() output is displayed.

    WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
    0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
    ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
    Call Trace:
    [] dump_stack+0x46/0x58
    [] panic+0xd0/0x204
    [] ? init_dummy+0x1f/0x30 [dummy_module]
    [] warn_slowpath_common+0xd0/0xd0
    [] ? dummy_greetings+0x40/0x40 [dummy_module]
    [] warn_slowpath_null+0x1a/0x20
    [] init_dummy+0x1f/0x30 [dummy_module]
    [] do_one_initcall+0xd4/0x210
    [] ? __vunmap+0xc2/0x110
    [] load_module+0x16a9/0x1b30
    [] ? store_uevent+0x70/0x70
    [] ? copy_module_from_fd.isra.44+0x129/0x180
    [] SyS_finit_module+0xa6/0xd0
    [] system_call_fastpath+0x12/0x17

    Successfully tested by me.

    hpa said: There is another very valid use for this: many operators would
    rather a machine shuts down than being potentially compromised either
    functionally or security-wise.

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Corbet
    Cc: Rusty Russell
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Masami Hiramatsu
    Acked-by: Yasuaki Ishimatsu
    Cc: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     
  • Macro get_unused_fd() is used to allocate a file descriptor with default
    flags. Those default flags (0) don't enable close-on-exec.

    This can be seen as an unsafe default: in most case close-on-exec should
    be enabled to not leak file descriptor across exec().

    It would be better to have a "safer" default set of flags, eg. O_CLOEXEC
    must be used to enable close-on-exec.

    Instead this patch removes get_unused_fd() so that out of tree modules
    won't be affect by a runtime behavor change which might introduce other
    kind of bugs: it's better to catch the change at build time, making it
    easier to fix.

    Removing the macro will also promote use of get_unused_fd_flags() (or
    anon_inode_getfd()) with flags provided by userspace. Or, if flags cannot
    be given by userspace, with flags set to O_CLOEXEC by default.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code
    start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
    used, either by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code start
    using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
    by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code start
    using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
    by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code start
    using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
    by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     
  • Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
    we can simply pass "dead_children" list to exit_ptrace() and remove
    another release_task() loop. Plus this way we do not need to drop and
    reacquire tasklist_lock.

    Also shift the list_empty(ptraced) check, if we want this optimization it
    makes sense to eliminate the function call altogether.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. Now that reparent_leader() doesn't abuse ->sibling we can shift
    list_move_tail() from reparent_leader() to forget_original_parent()
    and turn it into a single list_splice_tail_init(). This also makes
    BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.

    2. This also allows to shift the same_thread_group() check, it looks
    a bit more clear in the caller.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
    We need to change t->parent if and only if t is not traced.

    2. If we actually want this BUG_ON() to ensure that parent/ptrace
    match each other, then we should also take ptrace_reparented()
    case into account too.

    3. Change this code to use for_each_thread() instead of deprecated
    while_each_thread().

    [dan.carpenter@oracle.com: silence a bogus static checker warning]
    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
    into dead_children list we are going to release. This obviously removes
    the dead task from its real_parent->children list and this is even good;
    the parent can do nothing with the EXIT_DEAD reparented zombie, it only
    makes do_wait() slower.

    But, this also means that it can not be reparented once again, so if its
    new parent dies too nobody will update ->parent/real_parent, they can
    point to the freed memory even before release_task() we are going to call,
    this breaks the code which relies on pid_alive() to access
    ->real_parent/parent.

    Fortunately this is mostly theoretical, this can only happen if init or
    PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
    sub-thread exits right after we drop tasklist_lock.

    Change this code to use ->ptrace_entry instead, we know that the child is
    not traced so nobody can ever use this member. This also allows to unify
    this logic with exit_ptrace(), see the next changes.

    Note: we really need to change release_task() to nullify real_parent/
    parent/group_leader pointers, but we need to change the current users
    first somehow. And it would be better to reap this zombie immediately but
    release_task_locked() we need is complicated by proc_flush_task().

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • rcu_read_lock() can not protect p->real_parent if release_task(p) was
    already called, change sched_show_task() to check pis_alive() like other
    users do.

    Note: we need some helpers to cleanup the code like this. And it seems
    that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Acked-by: Peter Zijlstra (Intel)
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
    buys nothing and we can remove this check. Other callers already use it
    directly without additional checks.

    Note: with or without this patch ptrace_parent() can return the pointer to
    the freed task, this will be explained/fixed later.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() does seq_printf() under rcu_read_lock(), but this is only
    needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
    tgid/ngid and drop rcu lock.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. The usage of fdt looks very ugly, it can't be NULL if ->files is
    not NULL. We can use "unsigned int max_fds" instead.

    2. This also allows to move seq_printf(max_fds) outside of task_lock()
    and join it with the previous seq_printf(). See also the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() reads cred->group_info under task_lock() because a long ago
    it was task_struct->group_info and it was actually protected by
    task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
    adds the confusion, move task_unlock() up.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Better to use existing macro that rewriting them.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Dichtel
     
  • proc_register() error paths are leaking inodes and directory refcounts.

    Signed-off-by: Debabrata Banerjee
    Cc: Alexander Viro
    Acked-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Debabrata Banerjee
     
  • When a lot of netdevices are created, one of the bottleneck is the
    creation of proc entries. This serie aims to accelerate this part.

    The current implementation for the directories in /proc is using a single
    linked list. This is slow when handling directories with large numbers of
    entries (eg netdevice-related entries when lots of tunnels are opened).

    This patch replaces this linked list by a red-black tree.

    Here are some numbers:

    dummy30000.batch contains 30 000 times 'link add type dummy'.

    Before the patch:
    $ time ip -b dummy30000.batch
    real 2m31.950s
    user 0m0.440s
    sys 2m21.440s
    $ time rmmod dummy
    real 1m35.764s
    user 0m0.000s
    sys 1m24.088s

    After the patch:
    $ time ip -b dummy30000.batch
    real 2m0.874s
    user 0m0.448s
    sys 1m49.720s
    $ time rmmod dummy
    real 1m13.988s
    user 0m0.000s
    sys 1m1.008s

    The idea of improving this part was suggested by Thierry Herbelot.

    [akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
    Signed-off-by: Nicolas Dichtel
    Acked-by: David S. Miller
    Cc: Thierry Herbelot .
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Dichtel
     
  • Now that the external page_cgroup data structure and its lookup is
    gone, let the generic bad_page() check for page->mem_cgroup sanity.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Cc: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that the external page_cgroup data structure and its lookup is gone,
    the only code remaining in there is swap slot accounting.

    Rename it and move the conditional compilation into mm/Makefile.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroups used to have 5 per-page pointers. To allow users to
    disable that amount of overhead during runtime, those pointers were
    allocated in a separate array, with a translation layer between them and
    struct page.

    There is now only one page pointer remaining: the memcg pointer, that
    indicates which cgroup the page is associated with when charged. The
    complexity of runtime allocation and the runtime translation overhead is
    no longer justified to save that *potential* 0.19% of memory. With
    CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
    after the page->private member and doesn't even increase struct page,
    and then this patch actually saves space. Remaining users that care can
    still compile their kernels without CONFIG_MEMCG.

    text data bss dec hex filename
    8828345 1725264 983040 11536649 b00909 vmlinux.old
    8827425 1725264 966656 11519345 afc571 vmlinux.new

    [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is no cgroup-specific page lock anymore.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The largest index of swap device is MAX_SWAPFILES-1. So the type should
    be less than MAX_SWAPFILES.

    Signed-off-by: Haifeng Li
    Acked-by: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng
     
  • Signed-off-by Wei Yuan
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yuan
     
  • First, after flushing TLB, we have no need to scan pte from start again.
    Second, before bail out loop, the address is forwarded one step.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
    page accounting") mem_cgroup_end_page_stat consumes locked and flags
    variables directly rather than via pointers which might trigger C
    undefined behavior as those variables are initialized only in the slow
    path of mem_cgroup_begin_page_stat.

    Although mem_cgroup_end_page_stat handles parameters correctly and
    touches them only when they hold a sensible value it is caller which
    loads a potentially uninitialized value which then might allow compiler
    to do crazy things.

    I haven't seen any warning from gcc and it seems that the current
    version (4.9) doesn't exploit this type undefined behavior but Sasha has
    reported the following:

    UBSan: Undefined behaviour in mm/rmap.c:1084:2
    load of value 255 is not a valid value for type '_Bool'
    CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    ubsan_epilogue (lib/ubsan.c:159)
    __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
    page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
    unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
    unmap_single_vma (mm/memory.c:1348)
    unmap_vmas (mm/memory.c:1377 (discriminator 3))
    exit_mmap (mm/mmap.c:2837)
    mmput (kernel/fork.c:659)
    do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
    do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
    SyS_exit_group (kernel/exit.c:901)
    tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

    Fix this by using pointer parameters for both locked and flags and be
    more robust for future compiler changes even though the current code is
    implemented correctly.

    Signed-off-by: Michal Hocko
    Reported-by: Sasha Levin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • As a small zero page, huge zero page should not be accounted in smaps
    report as normal page.

    For small pages we rely on vm_normal_page() to filter out zero page, but
    vm_normal_page() is not designed to handle pmds. We only get here due
    hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
    necessary compatible on each and every architecture.

    Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
    detect huge zero page for us.

    We would need pmd_dirty() helper to do this properly. The patch adds it
    to THP-enabled architectures which don't yet have one.

    [akpm@linux-foundation.org: use do_div to fix 32-bit build]
    Signed-off-by: "Kirill A. Shutemov"
    Reported-by: Fengguang Wu
    Tested-by: Fengwei Yin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • None of the mem_cgroup_same_or_subtree() callers actually require it to
    take the RCU lock, either because they hold it themselves or they have css
    references. Remove it.

    To make the API change clear, rename the leftover helper to
    mem_cgroup_is_descendant() to match cgroup_is_descendant().

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It
    makes a lot more sense to check where it's looked up, rather than check
    for it in __mem_cgroup_same_or_subtree() where it's unexpected.

    No other callsite passes NULL to __mem_cgroup_same_or_subtree().

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • That function acts like a typecast - unless NULL is passed in, no NULL can
    come out. task_in_mem_cgroup() callers don't pass NULL tasks.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • While moving charges from one memcg to another, page stat updates must
    acquire the old memcg's move_lock to prevent double accounting. That
    situation is denoted by an increased memcg->move_accounting. However, the
    charge moving code declares this way too early for now, even before
    summing up the RSS and pre-allocating destination charges.

    Shorten this slowpath mode by increasing memcg->move_accounting only right
    before walking the task's address space with the intention of actually
    moving the pages.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Zero pages can be used only in anonymous mappings, which never have
    writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
    definitions.

    Let's drop redundant pmd_wrprotect() in set_huge_zero_page().

    Signed-off-by: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's use generic slab_start/next/stop for showing memcg caches info. In
    contrast to the current implementation, this will work even if all memcg
    caches' info doesn't fit into a seq buffer (a page), plus it simply looks
    neater.

    Actually, the main reason I do this isn't mere cleanup. I'm going to zap
    the memcg_slab_caches list, because I find it useless provided we have the
    slab_caches list, and this patch is a step in this direction.

    It should be noted that before this patch an attempt to read
    memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
    in -EIO, while after this patch it will silently show nothing except the
    header, but I don't think it will frustrate anyone.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on
    *any* NUMA node. However, the only place where it's called is
    mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific*
    zone. So the way it is used is incorrect - it will return true even if
    the cgroup doesn't have pages on the zone we're scanning.

    I think we can get rid of this check completely, because
    mem_cgroup_shrink_node_zone(), which is called by
    mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
    equivalent to shrink_lruvec(), which exits almost immediately if the
    lruvec passed to it is empty. So there's no need to optimize anything
    here. Besides, we don't have such a check in the general scan path
    (shrink_zone) either.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • hstate_sizelog() would shift left an int rather than long, triggering
    undefined behaviour and passing an incorrect value when the requested
    page size was more than 4GB, thus breaking >4GB pages.

    Signed-off-by: Sasha Levin
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Having these functions and their documentation split out and somewhere
    makes it harder, not easier, to follow what's going on.

    Inline them directly where charge moving is prepared and finished, and put
    an explanation right next to it.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_end_move() checks if the passed memcg is NULL, along with a
    lengthy comment to explain why this seemingly non-sensical situation is
    even possible.

    Check in cancel_attach() itself whether can_attach() set up the move
    context or not, it's a lot more obvious from there. Then remove the check
    and comment in mem_cgroup_end_move().

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner