06 Mar, 2019

2 commits

  • Since setting global init process to some memory cgroup is technically
    possible, oom_kill_memcg_member() must check it.

    Tasks in /test1 are going to be killed due to memory.oom.group set
    Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b

    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    static char buffer[10485760];
    static int pipe_fd[2] = { EOF, EOF };
    unsigned int i;
    int fd;
    char buf[64] = { };
    if (pipe(pipe_fd))
    return 1;
    if (chdir("/sys/fs/cgroup/"))
    return 1;
    fd = open("cgroup.subtree_control", O_WRONLY);
    write(fd, "+memory", 7);
    close(fd);
    mkdir("test1", 0755);
    fd = open("test1/memory.oom.group", O_WRONLY);
    write(fd, "1", 1);
    close(fd);
    fd = open("test1/cgroup.procs", O_WRONLY);
    write(fd, "1", 1);
    snprintf(buf, sizeof(buf) - 1, "%d", getpid());
    write(fd, buf, strlen(buf));
    close(fd);
    snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
    fd = open("test1/memory.max", O_WRONLY);
    write(fd, buf, strlen(buf));
    close(fd);
    for (i = 0; i < 10; i++)
    if (fork() == 0) {
    char c;
    close(pipe_fd[1]);
    read(pipe_fd[0], &c, 1);
    memset(buffer, 0, sizeof(buffer));
    sleep(3);
    _exit(0);
    }
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    sleep(3);
    return 0;
    }

    [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.062954][ T9185] Call Trace:
    [ 37.063976][ T9185] dump_stack+0x67/0x95
    [ 37.065263][ T9185] dump_header+0x51/0x570
    [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
    [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.069967][ T9185] oom_kill_process+0x18d/0x210
    [ 37.071515][ T9185] out_of_memory+0x11b/0x380
    [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
    [ 37.074601][ T9185] try_charge+0x790/0x820
    [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
    [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
    [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
    [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
    [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
    [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
    [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
    [ 37.086529][ T9185] do_page_fault+0x28/0x260
    [ 37.087788][ T9185] ? page_fault+0x8/0x30
    [ 37.088978][ T9185] page_fault+0x1e/0x30
    [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
    [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
    [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
    [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
    [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
    [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
    [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
    [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
    [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
    [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
    [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
    [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
    [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
    [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
    [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
    [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
    [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
    [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
    [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.361685][ T1] Call Trace:
    [ 37.363239][ T1] dump_stack+0x67/0x95
    [ 37.365010][ T1] panic+0xfc/0x2b0
    [ 37.366853][ T1] do_exit+0xd55/0xd60
    [ 37.368595][ T1] do_group_exit+0x47/0xc0
    [ 37.370415][ T1] get_signal+0x32a/0x920
    [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.374596][ T1] do_signal+0x32/0x6e0
    [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
    [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
    [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
    [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
    [ 37.384594][ T1] ? page_fault+0x8/0x30
    [ 37.386453][ T1] retint_user+0x8/0x18
    [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
    [ 37.389922][ T1] Code: Bad RIP value.
    [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
    [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
    [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
    [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
    [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
    [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0

    Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
    Fixes: 3d8b38eb81cac813 ("mm, oom: introduce memory.oom.group")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Since the start of the git history of Linux, the kernel after selecting
    the worst process to be oom-killed, prefer to kill its child (if the
    child does not share mm with the parent). Later it was changed to
    prefer to kill a child who is worst. If the parent is still the worst
    then the parent will be killed.

    This heuristic assumes that the children did less work than their parent
    and by killing one of them, the work lost will be less. However this is
    very workload dependent. If there is a workload which can benefit from
    this heuristic, can use oom_score_adj to prefer children to be killed
    before the parent.

    The select_bad_process() has already selected the worst process in the
    system/memcg. There is no need to recheck the badness of its children
    and hoping to find a worse candidate. That's a lot of unneeded racy
    work. Also the heuristic is dangerous because it make fork bomb like
    workloads to recover much later because we constantly pick and kill
    processes which are not memory hogs. So, let's remove this whole
    heuristic.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

02 Feb, 2019

2 commits

  • Syzbot instance running on upstream kernel found a use-after-free bug in
    oom_kill_process. On further inspection it seems like the process
    selected to be oom-killed has exited even before reaching
    read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
    tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
    and the put_task_struct within for_each_thread() frees the tsk and
    for_each_thread() tries to access the tsk. The easiest fix is to do
    get/put across the for_each_thread() on the selected task.

    Now the next question is should we continue with the oom-kill as the
    previously selected task has exited? However before adding more
    complexity and heuristics, let's answer why we even look at the children
    of oom-kill selected task? The select_bad_process() has already selected
    the worst process in the system/memcg. Due to race, the selected
    process might not be the worst at the kill time but does that matter?
    The userspace can use the oom_score_adj interface to prefer children to
    be killed before the parent. I looked at the history but it seems like
    this is there before git history.

    Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
    Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
    Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Arkadiusz reported that enabling memcg's group oom killing causes
    strange memcg statistics where there is no task in a memcg despite the
    number of tasks in that memcg is not 0. It turned out that there is a
    bug in wake_oom_reaper() which allows enqueuing same task twice which
    makes impossible to decrease the number of tasks in that memcg due to a
    refcount leak.

    This bug existed since the OOM reaper became invokable from
    task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

    T1@P1 |T2@P1 |T3@P1 |OOM reaper
    ----------+----------+----------+------------
    # Processing an OOM victim in a different memcg domain.
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    out_of_memory()
    oom_kill_process(P1)
    do_send_sig_info(SIGKILL, @P1)
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T2@P1)
    wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
    mutex_unlock(&oom_lock)
    # Completed processing an OOM victim in a different memcg domain.
    spin_lock(&oom_reaper_lock)
    # T1P1 is dequeued.
    spin_unlock(&oom_reaper_lock)

    but memcg's group oom killing made it easier to trigger this bug by
    calling wake_oom_reaper() on the same task from one out_of_memory()
    request.

    Fix this bug using an approach used by commit 855b018325737f76 ("oom,
    oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
    side effect of this patch, this patch also avoids enqueuing multiple
    threads sharing memory via task_will_free_mem(current) path.

    Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
    Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
    Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
    Signed-off-by: Tetsuo Handa
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Tejun Heo
    Cc: Aleksa Sarai
    Cc: Jay Kamat
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

29 Dec, 2018

4 commits

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The current oom report doesn't display victim's memcg context during the
    global OOM situation. While this information is not strictly needed, it
    can be really helpful for containerized environments to locate which
    container has lost a process. Now that we have a single line for the oom
    context, we can trivially add both the oom memcg (this can be either
    global_oom or a specific memcg which hits its hard limits) and task_memcg
    which is the victim's memcg.

    Below is the single line output in the oom report after this patch.

    - global oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid=

    - memcg oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid=

    [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
    Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
    Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Tetsuo Handa
    Cc: Roman Gushchin
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • OOM report contains several sections. The first one is the allocation
    context that has triggered the OOM. Then we have cpuset context followed
    by the stack trace of the OOM path. The tird one is the OOM memory
    information. Followed by the current memory state of all system tasks.
    At last, we will show oom eligible tasks and the information about the
    chosen oom victim.

    One thing that makes parsing more awkward than necessary is that we do not
    have a single and easily parsable line about the oom context. This patch
    is reorganizing the oom report to

    1) who invoked oom and what was the allocation request

    [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

    2) OOM stack trace

    [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
    [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
    [ 515.906821] Call Trace:
    [ 515.908062] dump_stack+0x5a/0x73
    [ 515.909311] dump_header+0x55/0x28c
    [ 515.914260] oom_kill_process+0x2d8/0x300
    [ 515.916708] out_of_memory+0x145/0x4a0
    [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16
    [ 515.919157] __alloc_pages_nodemask+0x277/0x290
    [ 515.920367] filemap_fault+0x3d0/0x6c0
    [ 515.921529] ? filemap_map_pages+0x2b8/0x420
    [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4]
    [ 515.923884] __do_fault+0x20/0x80
    [ 515.925032] __handle_mm_fault+0xbc0/0xe80
    [ 515.926195] handle_mm_fault+0xfa/0x210
    [ 515.927357] __do_page_fault+0x233/0x4c0
    [ 515.928506] do_page_fault+0x32/0x140
    [ 515.929646] ? page_fault+0x8/0x30
    [ 515.930770] page_fault+0x1e/0x30

    3) OOM memory information

    [ 515.958093] Mem-Info:
    [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
    active_file:4402672 inactive_file:483963 isolated_file:1344
    unevictable:0 dirty:4886753 writeback:0 unstable:0
    slab_reclaimable:148442 slab_unreclaimable:18741
    mapped:1347 shmem:1347 pagetables:58669 bounce:0
    free:88663 free_pcp:0 free_cma:0
    ...

    4) current memory state of all system tasks

    [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal
    [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad
    [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd
    [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd
    [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd
    [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance
    [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd
    [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager
    [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned
    [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic
    [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic
    [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic
    [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic
    [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic

    5) oom context (contrains and the chosen victim).

    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

    An admin can easily get the full oom context at a single line which
    makes parsing much easier.

    Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Roman Gushchin
    Cc: Tetsuo Handa
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

24 Oct, 2018

1 commit

  • …iederm/user-namespace

    Pull siginfo updates from Eric Biederman:
    "I have been slowly sorting out siginfo and this is the culmination of
    that work.

    The primary result is in several ways the signal infrastructure has
    been made less error prone. The code has been updated so that manually
    specifying SEND_SIG_FORCED is never necessary. The conversion to the
    new siginfo sending functions is now complete, which makes it
    difficult to send a signal without filling in the proper siginfo
    fields.

    At the tail end of the patchset comes the optimization of decreasing
    the size of struct siginfo in the kernel from 128 bytes to about 48
    bytes on 64bit. The fundamental observation that enables this is by
    definition none of the known ways to use struct siginfo uses the extra
    bytes.

    This comes at the cost of a small user space observable difference.
    For the rare case of siginfo being injected into the kernel only what
    can be copied into kernel_siginfo is delivered to the destination, the
    rest of the bytes are set to 0. For cases where the signal and the
    si_code are known this is safe, because we know those bytes are not
    used. For cases where the signal and si_code combination is unknown
    the bits that won't fit into struct kernel_siginfo are tested to
    verify they are zero, and the send fails if they are not.

    I made an extensive search through userspace code and I could not find
    anything that would break because of the above change. If it turns out
    I did break something it will take just the revert of a single change
    to restore kernel_siginfo to the same size as userspace siginfo.

    Testing did reveal dependencies on preferring the signo passed to
    sigqueueinfo over si->signo, so bit the bullet and added the
    complexity necessary to handle that case.

    Testing also revealed bad things can happen if a negative signal
    number is passed into the system calls. Something no sane application
    will do but something a malicious program or a fuzzer might do. So I
    have fixed the code that performs the bounds checks to ensure negative
    signal numbers are handled"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
    signal: Guard against negative signal numbers in copy_siginfo_from_user32
    signal: Guard against negative signal numbers in copy_siginfo_from_user
    signal: In sigqueueinfo prefer sig not si_signo
    signal: Use a smaller struct siginfo in the kernel
    signal: Distinguish between kernel_siginfo and siginfo
    signal: Introduce copy_siginfo_from_user and use it's return value
    signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
    signal: Fail sigqueueinfo if si_signo != sig
    signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
    signal/unicore32: Use force_sig_fault where appropriate
    signal/unicore32: Generate siginfo in ucs32_notify_die
    signal/unicore32: Use send_sig_fault where appropriate
    signal/arc: Use force_sig_fault where appropriate
    signal/arc: Push siginfo generation into unhandled_exception
    signal/ia64: Use force_sig_fault where appropriate
    signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
    signal/ia64: Use the generic force_sigsegv in setup_frame
    signal/arm/kvm: Use send_sig_mceerr
    signal/arm: Use send_sig_fault where appropriate
    signal/arm: Use force_sig_fault where appropriate
    ...

    Linus Torvalds
     

12 Sep, 2018

1 commit

  • Now that siginfo is never allocated for SIGKILL and SIGSTOP there is
    no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL
    and SIGSTOP. This makes SEND_SIG_FORCED unnecessary and redundant in
    the presence of SIGKILL and SIGSTOP. Therefore change users of
    SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use
    SEND_SIG_PRIV instead.

    This removes the last users of SEND_SIG_FORCED.

    Reviewed-by: Thomas Gleixner
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Sep, 2018

2 commits

  • Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
    notifiers") has added an ability to skip over vmas with blockable mmu
    notifiers. This however didn't call tlb_finish_mmu as it should.

    As a result inc_tlb_flush_pending has been called without its pairing
    dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush
    even though this is not really needed. This alone is not harmful and it
    seems there shouldn't be any such callers for oom victims at all but
    there is no real reason to skip tlb_finish_mmu on early skip either so
    call it.

    [mhocko@suse.com: new changelog]
    Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When the memcg OOM killer runs out of killable tasks, it currently
    prints a WARN with no further OOM context. This has caused some user
    confusion.

    Warnings indicate a kernel problem. In a reported case, however, the
    situation was triggered by a nonsensical memcg configuration (hard limit
    set to 0). But without any VM context this wasn't obvious from the
    report, and it took some back and forth on the mailing list to identify
    what is actually a trivial issue.

    Handle this OOM condition like we handle it in the global OOM killer:
    dump the full OOM context and tell the user we ran out of tasks.

    This way the user can identify misconfigurations easily by themselves
    and rectify the problem - without having to go through the hassle of
    running into an obscure but unsettling warning, finding the appropriate
    kernel mailing list and waiting for a kernel developer to remote-analyze
    that the memcg configuration caused this.

    If users cannot make sense of why the OOM killer was triggered or why it
    failed, they will still report it to the mailing list, we know that from
    experience. So in case there is an actual kernel bug causing this,
    kernel developers will very likely hear about it.

    Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Aug, 2018

7 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds
     
  • For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    Historically, there are two common solutions for this
    problem:
    1) enabling panic_on_oom,
    2) using a userspace daemon to monitor OOMs and kill
    all outstanding processes.

    Both approaches have their downsides: rebooting on each OOM is an obvious
    waste of capacity, and handling all in userspace is tricky and requires a
    userspace agent, which will monitor all cgroups for OOMs.

    In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
    the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
    management for userspace applications.

    This commit introduces a new knob for cgroup v2 memory controller:
    memory.oom.group. The knob determines whether the cgroup should be
    treated as an indivisible workload by the OOM killer. If set, all tasks
    belonging to the cgroup or to its descendants (if the memory cgroup is not
    a leaf cgroup) are killed together or not at all.

    To determine which cgroup has to be killed, we do traverse the cgroup
    hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
    and looking for the highest-level cgroup with memory.oom.group set.

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
    an exception and are never killed.

    This patch doesn't change the OOM victim selection algorithm.

    Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "introduce memory.oom.group", v2.

    This is a tiny implementation of cgroup-aware OOM killer, which adds an
    ability to kill a cgroup as a single unit and so guarantee the integrity
    of the workload.

    Although it has only a limited functionality in comparison to what now
    resides in the mm tree (it doesn't change the victim task selection
    algorithm, doesn't look at memory stas on cgroup level, etc), it's also
    much simpler and more straightforward. So, hopefully, we can avoid having
    long debates here, as we had with the full implementation.

    As it doesn't prevent any futher development, and implements an useful and
    complete feature, it looks as a sane way forward.

    This patch (of 2):

    oom_kill_process() consists of two logical parts: the first one is
    responsible for considering task's children as a potential victim and
    printing the debug information. The second half is responsible for
    sending SIGKILL to all tasks sharing the mm struct with the given victim.

    This commit splits oom_kill_process() with an intention to re-use the the
    second half: __oom_kill_process().

    The cgroup-aware OOM killer will kill multiple tasks belonging to the
    victim cgroup. We don't need to print the debug information for the each
    task, as well as play with task selection (considering task's children),
    so we can't use the existing oom_kill_process().

    Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
    Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Cc: Vladimir Davydov
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably

    - Undocumented return value.

    - comment "failed to reap part..." is misleading - sounds like it's
    referring to something which happened in the past, is in fact
    referring to something which might happen in the future.

    - fails to call trace_finish_task_reaping() in one case

    - code duplication.

    - Increases mmap_sem hold time a little by moving
    trace_finish_task_reaping() inside the locked region. So sue me ;)

    - Sharing the finish: path means that the trace event won't
    distinguish between the two sources of finishing.

    Add a short explanation for the return value and fix the rest by
    reorganizing the function a bit to have unified function exit paths.

    Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
    Suggested-by: Andrew Morton
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The default page memory unit of OOM task dump events might not be
    intuitive and potentially misleading for the non-initiated when debugging
    OOM events: These are pages and not kBs. Add a small printk prior to the
    task dump informing that the memory units are actually memory _pages_.

    Also extends PID field to align on up to 7 characters.
    Reference https://lkml.org/lkml/2018/7/3/1201

    Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
    Signed-off-by: Rodrigo Freire
    Acked-by: David Rientjes
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rodrigo Freire
     
  • oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
    close race with exiting task"). We do not really need the lock anymore
    though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
    concurrently") has removed serialization with the exit path based on the
    mm reference count and so we do not really rely on the oom_lock anymore.

    Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
    to prevent from races when the page allocator didn't manage to get the
    freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
    on and move on to another victim. Although this is possible in principle
    let's wait for it to actually happen in real life before we make the
    locking more complex again.

    Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
    oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
    MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
    now.

    [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
    Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

18 Aug, 2018

2 commits

  • Add comments describing oom_lock's scope.

    Requested-by: David Rientjes
    Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent
    munlock and oom reaper unmap, v3") we have a strong synchronization
    between the oom_killer and victim's exiting because both have to take
    the oom_lock. Therefore the original heuristic to sleep for a short
    time in out_of_memory doesn't serve the original purpose.

    Moreover Tetsuo has noticed that the short sleep can be more harmful
    than actually useful. Hammering the system with many processes can lead
    to a starvation when the task holding the oom_lock can block for a long
    time (minutes) and block any further progress because the oom_reaper
    depends on the oom_lock as well.

    Drop the short sleep from out_of_memory when we hold the lock. Keep the
    sleep when the trylock fails to throttle the concurrent OOM paths a bit.
    This should be solved in a more reasonable way (e.g. sleep proportional
    to the time spent in the active reclaiming etc.) but this is much more
    complex thing to achieve. This is a quick fixup to remove a stale code.

    Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Reviewed-by: Andrew Morton
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Jul, 2018

1 commit


15 Jun, 2018

1 commit

  • Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate
    when waking pollers") converted most of memcg event counters to
    per-memcg atomics, which made them less confusing for a user. The
    "oom_kill" counter remained untouched, so now it behaves differently
    than other counters (including "oom"). This adds nothing but confusion.

    Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
    MEMCG_OOM approach.

    This also removes a hack from count_memcg_event_mm(), introduced earlier
    specially for the OOM_KILL counter.

    [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
    Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

08 Jun, 2018

1 commit

  • This patch renames struct page_counter fields:
    count -> usage
    limit -> max

    and the corresponding functions:
    page_counter_limit() -> page_counter_set_max()
    mem_cgroup_get_limit() -> mem_cgroup_get_max()
    mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
    memcg_update_kmem_limit() -> memcg_update_kmem_max()
    memcg_update_tcp_limit() -> memcg_update_tcp_max()

    The idea behind this renaming is to have the direct matching
    between memory cgroup knobs (low, high, max) and page_counters API.

    This is pure renaming, this patch doesn't bring any functional change.

    Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

12 May, 2018

1 commit

  • Since exit_mmap() is done without the protection of mm->mmap_sem, it is
    possible for the oom reaper to concurrently operate on an mm until
    MMF_OOM_SKIP is set.

    This allows munlock_vma_pages_all() to concurrently run while the oom
    reaper is operating on a vma. Since munlock_vma_pages_range() depends
    on clearing VM_LOCKED from vm_flags before actually doing the munlock to
    determine if any other vmas are locking the same memory, the check for
    VM_LOCKED in the oom reaper is racy.

    This is especially noticeable on architectures such as powerpc where
    clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
    is zapped by the oom reaper during follow_page_mask() after the check
    for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
    kernel oops.

    Fix this by manually freeing all possible memory from the mm before
    doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
    run on the mm anymore so the munlock is safe to do in exit_mmap(). It
    also matches the logic that the oom reaper currently uses for
    determining when to set MMF_OOM_SKIP itself, so there's no new risk of
    excessive oom killing.

    This issue fixes CVE-2018-1000200.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: David Rientjes
    Suggested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Apr, 2018

3 commits

  • I got "oom_reaper: unable to reap pid:" messages when the victim thread
    was blocked inside free_pgtables() (which occurred after returning from
    unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
    exit_mmap() already set MMF_OOM_SKIP.

    Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
    oom_reaper: unable to reap pid:7558 (a.out)
    a.out D13272 7558 6931 0x00100084
    Call Trace:
    schedule+0x2d/0x80
    rwsem_down_write_failed+0x2bb/0x440
    call_rwsem_down_write_failed+0x13/0x20
    down_write+0x49/0x60
    unlink_file_vma+0x28/0x50
    free_pgtables+0x36/0x100
    exit_mmap+0xbb/0x180
    mmput+0x50/0x110
    copy_process.part.41+0xb61/0x1fe0
    _do_fork+0xe6/0x560
    do_syscall_64+0x74/0x230
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Since the 2.6 kernel, the oom killer has slightly biased away from
    CAP_SYS_ADMIN processes by discounting some of its memory usage in
    comparison to other processes.

    This has always been implicit and nothing exactly relies on the
    behavior.

    Gaurav notices that __task_cred() can dereference a potentially freed
    pointer if the task under consideration is exiting because a reference
    to the task_struct is not held.

    Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.

    If any CAP_SYS_ADMIN process would like to be biased against, it is
    always allowed to adjust /proc/pid/oom_score_adj.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Gaurav Kohli
    Acked-by: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

01 Feb, 2018

1 commit

  • This uses the new annotation to determine if an mm has mmu notifiers
    with blockable invalidate range callbacks to avoid oom reaping.
    Otherwise, the callbacks are used around unmap_page_range().

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Paolo Bonzini
    Cc: Christian König
    Cc: Dimitri Sivanich
    Cc: Andrea Arcangeli
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Oded Gabbay
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Joerg Roedel
    Cc: Doug Ledford
    Cc: Jani Nikula
    Cc: Mike Marciniszyn
    Cc: Sean Hefty
    Cc: Boris Ostrovsky
    Cc: Jérôme Glisse
    Cc: Radim Krčmář
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Dec, 2017

1 commit

  • David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

    BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
    addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 2 PID: 1001 Comm: oom_reaper
    Call Trace:
    unmap_page_range+0x1068/0x1130
    __oom_reap_task_mm+0xd5/0x16b
    oom_reaper+0xff/0x14c
    kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient. We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path. This can trivially
    happen from context of any task which has grabbed mm reference (e.g. to
    read /proc// file which requires mm etc.).

    The race would look like this

    oom_reaper oom_victim task
    mmget_not_zero
    do_exit
    mmput
    __oom_reap_task_mm mmput
    __mmput
    exit_mmap
    remove_vma
    unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task. Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: [4.14]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Nov, 2017

1 commit

  • tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
    space. In this case, tlb->fullmm is true. Some archs like arm64
    doesn't flush TLB when tlb->fullmm is true:

    commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

    Which causes leaking of tlb entries.

    Will clarifies his patch:
    "Basically, we tag each address space with an ASID (PCID on x86) which
    is resident in the TLB. This means we can elide TLB invalidation when
    pulling down a full mm because we won't ever assign that ASID to
    another mm without doing TLB invalidation elsewhere (which actually
    just nukes the whole TLB).

    I think that means that we could potentially not fault on a kernel
    uaccess, because we could hit in the TLB"

    There could be a window between complete_signal() sending IPI to other
    cores and all threads sharing this mm are really kicked off from cores.
    In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
    flush TLB then frees pages. However, due to the above problem, the TLB
    entries are not really flushed on arm64. Other threads are possible to
    access these pages through TLB entries. Moreover, a copy_to_user() can
    also write to these pages without generating page fault, causes
    use-after-free bugs.

    This patch gathers each vma instead of gathering full vm space. In this
    case tlb->fullmm is not true. The behavior of oom reaper become similar
    to munmapping before do_exit, which should be safe for all archs.

    Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Wang Nan
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Will Deacon
    Cc: Bob Liu
    Cc: Ingo Molnar
    Cc: Roman Gushchin
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     

16 Nov, 2017

6 commits

  • alloc_warn() and dump_header() have to explicitly handle NULL nodemask
    which forces both paths to use pr_cont. We can do better. printk
    already handles NULL pointers properly so all we need is to teach
    nodemask_pr_args to handle NULL nodemask carefully. This allows
    simplification of both alloc_warn() and dump_header() and gets rid of
    pr_cont altogether.

    This patch has been motivated by patch from Joe Perches

    http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com

    [akpm@linux-foundation.org: fix tile warning, per Arnd]
    Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Acked-by: Joe Perches
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Since oom_init() is called before userspace processes start, memory
    allocation failure for creating the OOM reaper kernel thread will let
    the OOM killer call panic() rather than wake up the OOM reaper.

    Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Currently, we account page tables separately for each page table level,
    but that's redundant -- we only make use of total memory allocated to
    page tables for oom_badness calculation. We also provide the
    information to userspace, but it has dubious value there too.

    This patch switches page table accounting to single counter.

    mm->pgtables_bytes is now used to account all page table levels. We use
    bytes, because page table size for different levels of page table tree
    may be different.

    The change has user-visible effect: we don't have VmPMD and VmPUD
    reported in /proc/[pid]/status. Not sure if anybody uses them. (As
    alternative, we can always report 0 kB for them.)

    OOM-killer report is also slightly changed: we now report pgtables_bytes
    instead of nr_ptes, nr_pmd, nr_puds.

    Apart from reducing number of counters per-mm, the benefit is that we
    now calculate oom_badness() more correctly for machines which have
    different size of page tables depending on level or where page tables
    are less than a page in size.

    The only downside can be debuggability because we do not know which page
    table level could leak. But I do not remember many bugs that would be
    caught by separate counters so I wouldn't lose sleep over this.

    [akpm@linux-foundation.org: fix mm/huge_memory.c]
    Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    [kirill.shutemov@linux.intel.com: fix build]
    Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
    and nr_pud.

    The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
    table accounting doesn't make sense if you don't have page tables.

    It's preparation for consolidation of page-table counters in mm_struct.

    Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • On a machine with 5-level paging support a process can allocate
    significant amount of memory and stay unnoticed by oom-killer and memory
    cgroup. The trick is to allocate a lot of PUD page tables. We don't
    account PUD page tables, only PMD and PTE.

    We already addressed the same issue for PMD page tables, see commit
    dc6c9a35b66b ("mm: account pmd page tables to the process").
    Introduction of 5-level paging brings the same issue for PUD page
    tables.

    The patch expands accounting to PUD level.

    [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
    Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
    [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
    Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
    Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Heiko Carstens
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The kernel may panic when an oom happens without killable process
    sometimes it is caused by huge unreclaimable slabs used by kernel.

    Although kdump could help debug such problem, however, kdump is not
    available on all architectures and it might be malfunction sometime.
    And, since kernel already panic it is worthy capturing such information
    in dmesg to aid touble shooting.

    Print out unreclaimable slab info (used size and total size) which
    actual memory usage is not zero (num_objs * size != 0) when
    unreclaimable slabs amount is greater than total user memory (LRU
    pages).

    The output looks like:

    Unreclaimable slab info:
    Name Used Total
    rpc_buffers 31KB 31KB
    rpc_tasks 7KB 7KB
    ebitmap_node 1964KB 1964KB
    avtab_node 5024KB 5024KB
    xfs_buf 1402KB 1402KB
    xfs_ili 134KB 134KB
    xfs_efi_item 115KB 115KB
    xfs_efd_item 115KB 115KB
    xfs_buf_item 134KB 134KB
    xfs_log_item_desc 342KB 342KB
    xfs_trans 1412KB 1412KB
    xfs_ifork 212KB 212KB

    [yang.s@alibaba-inc.com: v11]
    Link: http://lkml.kernel.org/r/1507656303-103845-4-git-send-email-yang.s@alibaba-inc.com
    Link: http://lkml.kernel.org/r/1507152550-46205-4-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Oct, 2017

1 commit

  • Andrea has noticed that the oom_reaper doesn't invalidate the range via
    mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
    corrupt the memory of the kvm guest for example.

    tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
    sufficient as per Andrea:

    "mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end. For KVM
    mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
    notifier implementation has to implement either ->invalidate_range
    method or the invalidate_range_start/end methods, not both. And if you
    implement invalidate_range_start/end like KVM is forced to do, calling
    mmu_notifier_invalidate_range in common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD iommuv2)
    can get away only implementing ->invalidate_range"

    As the callback is allowed to sleep and the implementation is out of
    hand of the MM it is safer to simply bail out if there is an mmu
    notifier registered. In order to not fail too early make the
    mm_has_notifiers check under the oom_lock and have a little nap before
    failing to give the current oom victim some more time to exit.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2017

1 commit

  • This is purely required because exit_aio() may block and exit_mmap() may
    never start, if the oom_reap_task cannot start running on a mm with
    mm_users == 0.

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

    If it wasn't because of the exit_aio or similar blocking functions in
    the last mmput, it would be enough to change the oom_reap_task() in the
    case it finds mm_users == 0, to wait for a timeout or to wait for
    __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
    problem here so the concurrency of exit_mmap and oom_reap_task is
    apparently warranted.

    It's a non standard runtime, exit_mmap() runs without mmap_sem, and
    oom_reap_task runs with the mmap_sem for reading as usual (kind of
    MADV_DONTNEED).

    The race between the two is solved with a combination of
    tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
    (serialized by a dummy down_write/up_write cycle on the same lines of
    the ksm_exit method).

    If the oom_reap_task() may be running concurrently during exit_mmap,
    exit_mmap will wait it to finish in down_write (before taking down mm
    structures that would make the oom_reap_task fail with use after free).

    If exit_mmap comes first, oom_reap_task() will skip the mm if
    MMF_OOM_SKIP is already set and in turn all memory is already freed and
    furthermore the mm data structures may already have been taken down by
    free_pgtables.

    [aarcange@redhat.com: incremental one liner]
    Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
    [rientjes@google.com: remove unused mmput_async]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
    [aarcange@redhat.com: microoptimization]
    Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
    Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
    Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Reported-by: David Rientjes
    Tested-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli