05 Sep, 2018

1 commit

  • [ Upstream commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67 ]

    In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
    in mem_cgroup_idr even after memcg memory is freed. This leads to leak
    of ID in mem_cgroup_idr.

    This patch adds removal into mem_cgroup_css_alloc(), which fixes the
    problem. For better readability, it adds a generic helper which is used
    in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

    Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
    Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Signed-off-by: Kirill Tkhai
    Acked-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     

25 Jul, 2018

1 commit

  • commit 9f15bde671355c351cf20d9f879004b234353100 upstream.

    It was reported that a kernel crash happened in mem_cgroup_iter(), which
    can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

    Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
    ......
    Call trace:
    mem_cgroup_iter+0x2e0/0x6d4
    shrink_zone+0x8c/0x324
    balance_pgdat+0x450/0x640
    kswapd+0x130/0x4b8
    kthread+0xe8/0xfc
    ret_from_fork+0x10/0x20

    mem_cgroup_iter():
    ......
    if (css_tryget(css)) position, which has been freed before and
    filled with POISON_FREE(0x6b).

    And the root cause of the use-after-free issue is that
    invalidate_reclaim_iterators() fails to reset the value of iter->position
    to NULL when the css of the memcg is released in non- hierarchical mode.

    Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
    Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
    Signed-off-by: Jing Xia
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jing Xia
     

21 Jun, 2018

1 commit

  • [ Upstream commit c892fd82cc0632d425ae011a4dd75eb59e9f84ee ]

    If there is heavy memory pressure, page allocation with __GFP_NOWAIT
    fails easily although it's order-0 request. I got below warning 9 times
    for normal boot.

    : page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
    .. snip ..
    Call trace:
    dump_backtrace+0x0/0x4
    dump_stack+0xa4/0xc0
    warn_alloc+0xd4/0x15c
    __alloc_pages_nodemask+0xf88/0x10fc
    alloc_slab_page+0x40/0x18c
    new_slab+0x2b8/0x2e0
    ___slab_alloc+0x25c/0x464
    __kmalloc+0x394/0x498
    memcg_kmem_get_cache+0x114/0x2b8
    kmem_cache_alloc+0x98/0x3e8
    mmap_region+0x3bc/0x8c0
    do_mmap+0x40c/0x43c
    vm_mmap_pgoff+0x15c/0x1e4
    sys_mmap+0xb0/0xc8
    el0_svc_naked+0x24/0x28
    Mem-Info:
    active_anon:17124 inactive_anon:193 isolated_anon:0
    active_file:7898 inactive_file:712955 isolated_file:55
    unevictable:0 dirty:27 writeback:18 unstable:0
    slab_reclaimable:12250 slab_unreclaimable:23334
    mapped:19310 shmem:212 pagetables:816 bounce:0
    free:36561 free_pcp:1205 free_cma:35615
    Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
    lowmem_reserve[]: 0 1842 1842
    Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
    lowmem_reserve[]: 0 0 0
    DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
    Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
    721350 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    945512 pages RAM
    0 pages HighMem/MovableOnly
    63408 pages reserved
    51200 pages cma reserved

    __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
    and the worker allocation failure is not really critical because we will
    retry on the next kmem charge. We might miss some charges but that
    shouldn't be critical. The excessive allocation failure report is not
    very helpful.

    [mhocko@kernel.org: changelog update]
    Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

16 May, 2018

1 commit

  • commit 4eaf431f6f71bbed40a4c733ffe93a7e8cedf9d9 upstream.

    syzbot has triggered a NULL ptr dereference when allocation fault
    injection enforces a failure and alloc_mem_cgroup_per_node_info
    initializes memcg->nodeinfo only half way through.

    But __mem_cgroup_free still tries to free all per-node data and
    dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
    per-node data hasn't been initialized.

    The bug is quite unlikely to hit because small allocations do not fail
    and we would need quite some numa nodes to make struct
    mem_cgroup_per_node large enough to cross the costly order.

    Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
    Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrey Ryabinin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

13 Feb, 2018

1 commit

  • [ Upstream commit edbe69ef2c90fc86998a74b08319a01c508bd497 ]

    This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
    defer call to mem_cgroup_sk_alloc()").

    Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
    memcg socket memory accounting, as packets received before memcg
    pointer initialization are not accounted and are causing refcounting
    underflow on socket release.

    Actually the free-after-use problem was fixed by
    commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
    sk_clone_lock()") for the cgroup pointer.

    So, let's revert it and call mem_cgroup_sk_alloc() just before
    cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
    we're cloning, and it holds a reference to the memcg.

    Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
    mem_cgroup_sk_alloc(). I see no reasons why bumping the root
    memcg counter is a good reason to panic, and there are no realistic
    ways to hit it.

    Signed-off-by: Roman Gushchin
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

05 Dec, 2017

1 commit

  • commit d08afa149acfd00871484ada6dabc3880524cd1c upstream.

    Commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
    support THP") changed mem_cgroup_swapout() to support transparent huge
    page (THP).

    However the patch missed one location which should be changed for
    correctly handling THPs. The resulting bug will cause the memory
    cgroups whose THPs were swapped out to become zombies on deletion.

    Link: http://lkml.kernel.org/r/20171128161941.20931-1-shakeelb@google.com
    Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Huang Ying
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Greg Kroah-Hartman

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

10 Oct, 2017

1 commit

  • Instead of calling mem_cgroup_sk_alloc() from BH context,
    it is better to call it from inet_csk_accept() in process context.

    Not only this removes code in mem_cgroup_sk_alloc(), but it also
    fixes a bug since listener might have been dismantled and css_get()
    might cause a use-after-free.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Oct, 2017

2 commits

  • Fix for 4.14, zone device page always have an elevated refcount of one
    and thus page count sanity check in uncharge_page() is inappropriate for
    them.

    [mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
    Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Michal Hocko
    Reported-by: Evgeny Baskakov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The following lockdep splat has been noticed during LTP testing

    ======================================================
    WARNING: possible circular locking dependency detected
    4.13.0-rc3-next-20170807 #12 Not tainted
    ------------------------------------------------------
    a.out/4771 is trying to acquire lock:
    (cpu_hotplug_lock.rw_sem){++++++}, at: [] drain_all_stock.part.35+0x18/0x140

    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc9/0x230
    __might_fault+0x70/0xa0
    _copy_to_user+0x23/0x70
    filldir+0xa7/0x110
    xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
    xfs_readdir+0x1fa/0x2c0 [xfs]
    xfs_file_readdir+0x30/0x40 [xfs]
    iterate_dir+0x17a/0x1a0
    SyS_getdents+0xb0/0x160
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    -> #2 (&type->i_mutex_dir_key#3){++++++}:
    lock_acquire+0xc9/0x230
    down_read+0x51/0xb0
    lookup_slow+0xde/0x210
    walk_component+0x160/0x250
    link_path_walk+0x1a6/0x610
    path_openat+0xe4/0xd50
    do_filp_open+0x91/0x100
    file_open_name+0xf5/0x130
    filp_open+0x33/0x50
    kernel_read_file_from_path+0x39/0x80
    _request_firmware+0x39f/0x880
    request_firmware_direct+0x37/0x50
    request_microcode_fw+0x64/0xe0
    reload_store+0xf7/0x180
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x44/0x60
    kernfs_fop_write+0x113/0x1a0
    __vfs_write+0x37/0x170
    vfs_write+0xc7/0x1c0
    SyS_write+0x58/0xc0
    do_syscall_64+0x6c/0x1f0
    return_from_SYSCALL_64+0x0/0x7a

    -> #1 (microcode_mutex){+.+.+.}:
    lock_acquire+0xc9/0x230
    __mutex_lock+0x88/0x960
    mutex_lock_nested+0x1b/0x20
    microcode_init+0xbb/0x208
    do_one_initcall+0x51/0x1a9
    kernel_init_freeable+0x208/0x2a7
    kernel_init+0xe/0x104
    ret_from_fork+0x2a/0x40

    -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
    __lock_acquire+0x153c/0x1550
    lock_acquire+0xc9/0x230
    cpus_read_lock+0x4b/0x90
    drain_all_stock.part.35+0x18/0x140
    try_charge+0x3ab/0x6e0
    mem_cgroup_try_charge+0x7f/0x2c0
    shmem_getpage_gfp+0x25f/0x1050
    shmem_fault+0x96/0x200
    __do_fault+0x1e/0xa0
    __handle_mm_fault+0x9c3/0xe00
    handle_mm_fault+0x16e/0x380
    __do_page_fault+0x24a/0x530
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    other info that might help us debug this:

    Chain exists of:
    cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(&type->i_mutex_dir_key#3);
    lock(&mm->mmap_sem);
    lock(cpu_hotplug_lock.rw_sem);

    *** DEADLOCK ***

    2 locks held by a.out/4771:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530
    #1: (percpu_charge_mutex){+.+...}, at: [] try_charge+0x397/0x6e0

    The problem is very similar to the one fixed by commit a459eeb7b852
    ("mm, page_alloc: do not depend on cpu hotplug locks inside the
    allocator"). We are taking hotplug locks while we can be sitting on top
    of basically arbitrary locks. This just calls for problems.

    We can get rid of {get,put}_online_cpus, fortunately. We do not have to
    be worried about races with memory hotplug because drain_local_stock,
    which is called from both the WQ draining and the memory hotplug
    contexts, is always operating on the local cpu stock with IRQs disabled.

    The only thing to be careful about is that the target memcg doesn't
    vanish while we are still in drain_all_stock so take a reference on it.

    Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Artem Savkov
    Tested-by: Artem Savkov
    Cc: Johannes Weiner
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Sep, 2017

6 commits

  • Such that we can optimize __mem_cgroup_largest_soft_limit_node(). The
    only overhead is the extra footprint for the cached pointer, but this
    should not be an issue for mem_cgroup_tree_per_node.

    [dave@stgolabs.net: brain fart #2]
    Link: http://lkml.kernel.org/r/20170731160114.GE21328@linux-80c1.suse
    Link: http://lkml.kernel.org/r/20170719014603.19029-17-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We've noticed a quite noticeable performance overhead on some hosts with
    significant network traffic when socket memory accounting is enabled.

    Perf top shows that socket memory uncharging path is hot:
    2.13% [kernel] [k] page_counter_cancel
    1.14% [kernel] [k] __sk_mem_reduce_allocated
    1.14% [kernel] [k] _raw_spin_lock
    0.87% [kernel] [k] _raw_spin_lock_irqsave
    0.84% [kernel] [k] tcp_ack
    0.84% [kernel] [k] ixgbe_poll
    0.83% < workload >
    0.82% [kernel] [k] enqueue_entity
    0.68% [kernel] [k] __fget
    0.68% [kernel] [k] tcp_delack_timer_handler
    0.67% [kernel] [k] __schedule
    0.60% < workload >
    0.59% [kernel] [k] __inet6_lookup_established
    0.55% [kernel] [k] __switch_to
    0.55% [kernel] [k] menu_select
    0.54% libc-2.20.so [.] __memcpy_avx_unaligned

    To address this issue, the existing per-cpu stock infrastructure can be
    used.

    refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
    charge to a per-cpu stock instead of calling atomic
    page_counter_uncharge().

    To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
    will explicitly drain the cached charge, if the cached value exceeds
    CHARGE_BATCH.

    This allows significantly optimize the load:
    1.21% [kernel] [k] _raw_spin_lock
    1.01% [kernel] [k] ixgbe_poll
    0.92% [kernel] [k] _raw_spin_lock_irqsave
    0.90% [kernel] [k] enqueue_entity
    0.86% [kernel] [k] tcp_ack
    0.85% < workload >
    0.74% perf-11120.map [.] 0x000000000061bf24
    0.73% [kernel] [k] __schedule
    0.67% [kernel] [k] __fget
    0.63% [kernel] [k] __inet6_lookup_established
    0.62% [kernel] [k] menu_select
    0.59% < workload >
    0.59% [kernel] [k] __switch_to
    0.57% libc-2.20.so [.] __memcpy_avx_unaligned

    Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM pages (private or public device pages) are ZONE_DEVICE page and thus
    need special handling when it comes to lru or refcount. This patch make
    sure that memcontrol properly handle those when it face them. Those pages
    are use like regular pages in a process address space either as anonymous
    page or as file back page. So from memcg point of view we want to handle
    them like regular page for now at least.

    Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Balbir Singh
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM pages (private or public device pages) are ZONE_DEVICE page and
    thus you can not use page->lru fields of those pages. This patch
    re-arrange the uncharge to allow single page to be uncharge without
    modifying the lru field of the struct page.

    There is no change to memcontrol logic, it is the same as it was
    before this patch.

    Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Sep, 2017

7 commits

  • Pull cgroup updates from Tejun Heo:
    "Several notable changes this cycle:

    - Thread mode was merged. This will be used for cgroup2 support for
    CPU and possibly other controllers. Unfortunately, CPU controller
    cgroup2 support didn't make this pull request but most contentions
    have been resolved and the support is likely to be merged before
    the next merge window.

    - cgroup.stat now shows the number of descendant cgroups.

    - cpuset now can enable the easier-to-configure v2 behavior on v1
    hierarchy"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cpuset: Allow v2 behavior in v1 cgroup
    cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
    cgroup: remove unneeded checks
    cgroup: misc changes
    cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
    cgroup: re-use the parent pointer in cgroup_destroy_locked()
    cgroup: add cgroup.stat interface with basic hierarchy stats
    cgroup: implement hierarchy limits
    cgroup: keep track of number of descent cgroups
    cgroup: add comment to cgroup_enable_threaded()
    cgroup: remove unnecessary empty check when enabling threaded mode
    cgroup: update debug controller to print out thread mode information
    cgroup: implement cgroup v2 thread support
    cgroup: implement CSS_TASK_ITER_THREADED
    cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
    cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
    cgroup: reorganize cgroup.procs / task write path
    cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
    cgroup: distinguish local and children populated states
    cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
    ...

    Linus Torvalds
     
  • TIF_MEMDIE is set only to the tasks whick were either directly selected
    by the OOM killer or passed through mark_oom_victim from the allocator
    path. tsk_is_oom_victim is more generic and allows to identify all
    tasks (threads) which share the mm with the oom victim.

    Please note that the freezer still needs to check TIF_MEMDIE because we
    cannot thaw tasks which do not participage in oom_victims counting
    otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

    Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patch makes mem_cgroup_swapout() works for the transparent huge
    page (THP). Which will move the memory cgroup charge from memory to
    swap for a THP.

    This will be used for the THP swap support. Where a THP may be swapped
    out as a whole to a set of (HPAGE_PMD_NR) continuous swap slots on the
    swap device.

    Link: http://lkml.kernel.org/r/20170724051840.2309-11-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL. So to
    check whether the page is charged already, we need to check the head
    page. This is not an issue before because it is impossible for a THP to
    be in the swap cache before. But after we add delaying splitting THP
    after swapped out support, it is possible now.

    Link: http://lkml.kernel.org/r/20170724051840.2309-10-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • PTE mapped THP (Transparent Huge Page) will be ignored when moving
    memory cgroup charge. But for THP which is in the swap cache, the
    memory cgroup charge for the swap of a tail-page may be moved in current
    implementation. That isn't correct, because the swap charge for all
    sub-pages of a THP should be moved together. Following the processing
    of the PTE mapped THP, the mem cgroup charge moving for the swap entry
    for a tail-page of a THP is ignored too.

    Link: http://lkml.kernel.org/r/20170724051840.2309-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Several functions use an enum type as parameter for an event/state, but
    are called in some locations with an argument of a different enum type.
    Adjust the interface of these functions to reality by changing the
    parameter to int.

    This fixes a ton of enum-conversion warnings that are generated when
    building the kernel with clang.

    [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()]
    Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org
    Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org
    Signed-off-by: Matthias Kaehlcke
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Doug Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • A removed memory cgroup with a defined memory.low and some belonging
    pagecache has very low chances to be freed.

    If a cgroup has been removed, there is likely no memory pressure inside
    the cgroup, and the pagecache is protected from the external pressure by
    the defined low limit. The cgroup will be freed only after the reclaim
    of all belonging pages. And it will not happen until there are any
    reclaimable memory in the system. That means, there is a good chance,
    that a cold pagecache will reside in the memory for an undefined amount
    of time, wasting system resources.

    This problem was fixed earlier by fa06235b8eb0 ("cgroup: reset css on
    destruction"), but it's not a best way to do it, as we can't really
    reset all limits/counters during cgroup offlining.

    Link: http://lkml.kernel.org/r/20170727130428.28856-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

19 Aug, 2017

1 commit

  • Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
    to update the memcg stats:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
    IP: test_clear_page_writeback+0x12e/0x2c0
    [...]
    RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
    Call Trace:

    end_page_writeback+0x47/0x70
    f2fs_write_end_io+0x76/0x180 [f2fs]
    bio_endio+0x9f/0x120
    blk_update_request+0xa8/0x2f0
    scsi_end_request+0x39/0x1d0
    scsi_io_completion+0x211/0x690
    scsi_finish_command+0xd9/0x120
    scsi_softirq_done+0x127/0x150
    __blk_mq_complete_request_remote+0x13/0x20
    flush_smp_call_function_queue+0x56/0x110
    generic_smp_call_function_single_interrupt+0x13/0x30
    smp_call_function_single_interrupt+0x27/0x40
    call_function_single_interrupt+0x89/0x90
    RIP: 0010:native_safe_halt+0x6/0x10

    (gdb) l *(test_clear_page_writeback+0x12e)
    0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
    614 mod_node_page_state(page_pgdat(page), idx, val);
    615 if (mem_cgroup_disabled() || !page->mem_cgroup)
    616 return;
    617 mod_memcg_state(page->mem_cgroup, idx, val);
    618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
    619 this_cpu_add(pn->lruvec_stat->count[idx], val);
    620 }
    621
    622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
    623 gfp_t gfp_mask,

    The issue is that writeback doesn't hold a page reference and the page
    might get freed after PG_writeback is cleared (and the mapping is
    unlocked) in test_clear_page_writeback(). The stat functions looking up
    the page's node or zone are safe, as those attributes are static across
    allocation and free cycles. But page->mem_cgroup is not, and it will
    get cleared if we race with truncation or migration.

    It appears this race window has been around for a while, but less likely
    to trigger when the memcg stats were updated first thing after
    PG_writeback is cleared. Recent changes reshuffled this code to update
    the global node stats before the memcg ones, though, stretching the race
    window out to an extent where people can reproduce the problem.

    Update test_clear_page_writeback() to look up and pin page->mem_cgroup
    before clearing PG_writeback, then not use that pointer afterward. It
    is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    but leaves the pageref-holding callsites that aren't affected alone.

    Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
    Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    Signed-off-by: Johannes Weiner
    Reported-by: Jaegeuk Kim
    Tested-by: Jaegeuk Kim
    Reported-by: Bradley Bolen
    Tested-by: Brad Bolen
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Jul, 2017

1 commit

  • css_task_iter currently always walks all tasks. With the scheduled
    cgroup v2 thread support, the iterator would need to handle multiple
    types of iteration. As a preparation, add @flags to
    css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
    is not specified, it walks all tasks as before. When asserted, the
    iterator only walks the group leaders.

    For now, the only user of the flag is cgroup v2 "cgroup.procs" file
    which no longer needs to skip non-leader tasks in cgroup_procs_next().
    Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
    v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
    cgroup" but "list all thread group id's with any threads in the
    cgroup".

    While at it, update cgroup_procs_show() to use task_pid_vnr() instead
    of task_tgid_vnr(). As the iteration guarantees that the function
    only sees group leaders, this doesn't change the output and will allow
    sharing the function for thread iteration.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

11 Jul, 2017

2 commits

  • Alice has reported the following UBSAN splat:

    UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
    signed integer overflow:
    -2147483644 - 2147483525 cannot be represented in type 'long int'
    CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4
    Hardware name: XXXXXX, BIOS YYYYYY
    Call Trace:
    dump_stack+0x59/0x87
    ubsan_epilogue+0xe/0x40
    handle_overflow+0xbb/0xf0
    __ubsan_handle_sub_overflow+0x12/0x20
    memcg_check_events.isra.36+0x223/0x360
    mem_cgroup_commit_charge+0x55/0x140
    wp_page_copy+0x34e/0xb80
    do_wp_page+0x1e6/0x1300
    handle_mm_fault+0x88b/0x1990
    __do_page_fault+0x2de/0x8a0
    do_page_fault+0x1a/0x20
    error_code+0x67/0x6c

    The reason is that we subtract two signed types. Let's fix this by
    truly mimicing time_after and cast the result of the subtraction.

    Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Alice Ferrazzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Make @root exclusive in mem_cgroup_low; it is never considered low when
    looked at directly and is not checked when traversing the tree. In
    effect, @root is handled identically to how root_mem_cgroup was
    previously handled by mem_cgroup_low.

    If @root is not excluded from the checks, a cgroup underneath @root will
    never be considered low during targeted reclaim of @root, e.g. due to
    memory.current > memory.high, unless @root is misconfigured to have
    memory.low > memory.high.

    Excluding @root enables using memory.low to prioritize memory usage
    between cgroups within a subtree of the hierarchy that is limited by
    memory.high or memory.max, e.g. when ROOT owns @root's controls but
    delegates the @root directory to a USER so that USER can create and
    administer children of @root.

    For example, given cgroup A with children B and C:

    A
    / \
    B C

    and

    1. A/memory.current > A/memory.high
    2. A/B/memory.current < A/B/memory.low
    3. A/C/memory.current >= A/C/memory.low

    As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
    should reclaim from 'C' until 'A' is no longer high or until we can no
    longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by
    mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
    and we will reclaim indiscriminately from both 'B' and 'C'.

    Here is the test I used to confirm the bug and the patch.

    20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
    #!/bin/bash

    x62mb=$((62<<<< /dev/null
    sudo mkswap /mnt/1gb.swap > /dev/null
    fi
    if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
    sudo swapon /mnt/1gb.swap
    fi

    if [[ ! -e /cgroup/cgroup.controllers ]]; then
    sudo mount -t cgroup2 none /cgroup
    fi

    grep -q memory /cgroup/cgroup.controllers

    sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"

    sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
    sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
    sudo sh -c "echo '96m' > /cgroup/A/memory.high"

    mkdir /cgroup/A/0
    mkdir /cgroup/A/1

    echo 64m > /cgroup/A/0/memory.low
    }

    teardown() {
    set +e

    trap - EXIT HUP INT TERM

    if [[ -z $1 ]]; then
    printf "\n"
    printf "%0.s*" {1..35}
    printf "\nFAILED!\n\n"
    tail /cgroup/A/**/memory.current
    printf "%0.s*" {1..35}
    printf "\n\n"
    fi

    ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %

    sleep 2

    if [[ -e /cgroup/A/0 ]]; then
    rmdir /cgroup/A/0
    fi
    if [[ -e /cgroup/A/1 ]]; then
    rmdir /cgroup/A/1
    fi
    if [[ -e /cgroup/A ]]; then
    sudo rmdir /cgroup/A
    fi
    }

    stress_test() {
    sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
    stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

    sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
    stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

    sudo sh -c "echo $$ > /cgroup/cgroup.procs"

    sleep 1

    # A/0 should be consuming more memory than A/1
    [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]

    # A/0 should be consuming ~64mb
    [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]

    # A should cumulatively be consuming ~96mb
    [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]

    # Stop the stressors
    ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
    }

    teardown 1
    setup

    for ((i=1;i
    Acked-by: Vladimir Davydov
    Acked-by: Balbir Singh
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean Christopherson
     

07 Jul, 2017

5 commits

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that the slab counters are moved from the zone to the node level we
    can drop the private memcg node stats and use the official ones.

    Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

13 May, 2017

1 commit

  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

6 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items.

    This increases the size of the event array, but we'll eventually want
    most of the VM events tracked on a per-cgroup basis anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cgroups currently don't report how much shmem they use, which can be
    useful data to have, in particular since shmem is included in the
    cache/file item while being reclaimed like anonymous memory.

    Add a counter to track shmem pages during charging and uncharging.

    Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Chris Down
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner