06 Apr, 2018

3 commits

  • Kswapd will not wakeup if per-zone watermarks are not failing or if too
    many previous attempts at background reclaim have failed.

    This can be true if there is a lot of free memory available. For high-
    order allocations, kswapd is responsible for waking up kcompactd for
    background compaction. If the zone is not below its watermarks or
    reclaim has recently failed (lots of free memory, nothing left to
    reclaim), kcompactd does not get woken up.

    When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
    woken up even if kswapd will not reclaim. This allows high-order
    allocations, such as thp, to still trigger background compaction even
    when the zone has an abundance of free memory.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since we no longer use return value of shrink_slab() for normal reclaim,
    the comment is no longer true. If some do_shrink_slab() call takes
    unexpectedly long (root cause of stall is currently unknown) when
    register_shrinker()/unregister_shrinker() is pending, trying to drop
    caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
    loop if many mem_cgroup are defined. For safety, let's not pretend
    forward progress.

    Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Dave Chinner
    Cc: Glauber Costa
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

23 Mar, 2018

1 commit

  • Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

22 Feb, 2018

1 commit

  • When a thread mlocks an address space backed either by file pages which
    are currently not present in memory or swapped out anon pages (not in
    swapcache), a new page is allocated and added to the local pagevec
    (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
    On I/O completion, the thread can wake on a different CPU, the mlock
    syscall will then sets the PageMlocked() bit of the page but will not be
    able to put that page in unevictable LRU as the page is on the pagevec
    of a different CPU. Even on drain, that page will go to evictable LRU
    because the PageMlocked() bit is not checked on pagevec drain.

    The page will eventually go to right LRU on reclaim but the LRU stats
    will remain skewed for a long time.

    This patch puts all the pages, even unevictable, to the pagevecs and on
    the drain, the pages will be added on their LRUs correctly by checking
    their evictability. This resolves the mlocked pages on pagevec of other
    CPUs issue because when those pagevecs will be drained, the mlocked file
    pages will go to unevictable LRU. Also this makes the race with munlock
    easier to resolve because the pagevec drains happen in LRU lock.

    However there is still one place which makes a page evictable and does
    PageLRU check on that page without LRU lock and needs special attention.
    TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

    #0: __pagevec_lru_add_fn #1: clear_page_mlock

    SetPageLRU() if (!TestClearPageMlocked())
    return
    smp_mb() //
    Acked-by: Vlastimil Babka
    Cc: Jérôme Glisse
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Jan Kara
    Cc: Nicholas Piggin
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit


01 Feb, 2018

4 commits

  • Minchan Kim asked the following question -- what locks protects
    address_space destroying when race happens between inode trauncation and
    __isolate_lru_page? Jan Kara clarified by describing the race as follows

    CPU1 CPU2

    truncate(inode) __isolate_lru_page()
    ...
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    if (mapping && !mapping->a_ops->migratepage)
    - we've dereferenced mapping which is potentially already free.

    The race is theoretically possible but unlikely. Before the
    delete_from_page_cache, truncate_cleanup_page is called so the page is
    likely to be !PageDirty or PageWriteback which gets skipped by the only
    caller that checks the mappping in __isolate_lru_page. Even if the race
    occurs, a substantial amount of work has to happen during a tiny window
    with no preemption but it could potentially be done using a virtual
    machine to artifically slow one CPU or halt it during the critical
    window.

    This patch should eliminate the race with truncation by try-locking the
    page before derefencing mapping and aborting if the lock was not
    acquired. There was a suggestion from Huang Ying to use RCU as a
    side-effect to prevent mapping being freed. However, I do not like the
    solution as it's an unconventional means of preserving a mapping and
    it's not a context where rcu_read_lock is obviously protecting rcu data.

    Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
    Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove unused function pgdat_reclaimable_pages() and
    node_page_state_snapshot() which becomes unused as well.

    Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Shakeel Butt reported he has observed in production systems that the job
    loader gets stuck for 10s of seconds while doing a mount operation. It
    turns out that it was stuck in register_shrinker() because some
    unrelated job was under memory pressure and was spending time in
    shrink_slab(). Machines have a lot of shrinkers registered and jobs
    under memory pressure have to traverse all of those memcg-aware
    shrinkers and affect unrelated jobs which want to register their own
    shrinkers.

    To solve the issue, this patch simply bails out slab shrinking if it is
    found that someone wants to register a shrinker in parallel. A downside
    is it could cause unfair shrinking between shrinkers. However, it
    should be rare and we can add compilcated logic if we find it's not
    enough.

    [akpm@linux-foundation.org: tweak code comment]
    Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
    Link: http://lkml.kernel.org/r/1511481899-20335-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Signed-off-by: Shakeel Butt
    Reported-by: Shakeel Butt
    Tested-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Previously we were using the ratio of the number of lru pages scanned to
    the number of eligible lru pages to determine the number of slab objects
    to scan. The problem with this is that these two things have nothing to
    do with each other, so in slab heavy work loads where there is little to
    no page cache we can end up with the pages scanned being a very low
    number. This means that we reclaim next to no slab pages and waste a
    lot of time reclaiming small amounts of space.

    Consider the following scenario, where we have the following values and
    the rest of the memory usage is in slab

    Active: 58840 kB
    Inactive: 46860 kB

    Every time we do a get_scan_count() we do this

    scan = size >> sc->priority

    where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
    through reclaim would result in a scan target of 2 pages to 11715 total
    inactive pages, and 3 pages to 14710 total active pages. This is a
    really really small target for a system that is entirely slab pages.
    And this is super optimistic, this assumes we even get to scan these
    pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
    which assumes it's not in use, and 2) can lock the page. Under pressure
    these numbers could probably go down, I'm sure there's some random pages
    from daemons that aren't actually in use, so the targets get even
    smaller.

    Instead use sc->priority in the same way we use it to determine scan
    amounts for the lru's. This generally equates to pages. Consider the
    following

    slab_pages = (nr_objects * object_size) / PAGE_SIZE

    What we would like to do is

    scan = slab_pages >> sc->priority

    but we don't know the number of slab pages each shrinker controls, only
    the objects. However say that theoretically we knew how many pages a
    shrinker controlled, we'd still have to convert this to objects, which
    would look like the following

    scan = shrinker_pages >> sc->priority
    scan_objects = (PAGE_SIZE / object_size) * scan

    or written another way

    scan_objects = (shrinker_pages >> sc->priority) *
    (PAGE_SIZE / object_size)

    which can thus be written

    scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
    sc->priority

    which is just

    scan_objects = nr_objects >> sc->priority

    We don't need to know exactly how many pages each shrinker represents,
    it's objects are all the information we need. Making this change allows
    us to place an appropriate amount of pressure on the shrinker pools for
    their relative size.

    Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Dave Chinner
    Acked-by: Andrey Ryabinin
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

19 Dec, 2017

1 commit

  • Syzbot caught an oops at unregister_shrinker() because combination of
    commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
    injection made register_shrinker() fail and the caller of
    register_shrinker() did not check for failure.

    ----------
    [ 554.881422] FAULT_INJECTION: forcing a failure.
    [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
    [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.881445] Call Trace:
    [ 554.881459] dump_stack+0x194/0x257
    [ 554.881474] ? arch_local_irq_restore+0x53/0x53
    [ 554.881486] ? find_held_lock+0x35/0x1d0
    [ 554.881507] should_fail+0x8c0/0xa40
    [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
    [ 554.881537] ? check_noncircular+0x20/0x20
    [ 554.881546] ? find_next_zero_bit+0x2c/0x40
    [ 554.881560] ? ida_get_new_above+0x421/0x9d0
    [ 554.881577] ? find_held_lock+0x35/0x1d0
    [ 554.881594] ? __lock_is_held+0xb6/0x140
    [ 554.881628] ? check_same_owner+0x320/0x320
    [ 554.881634] ? lock_downgrade+0x990/0x990
    [ 554.881649] ? find_held_lock+0x35/0x1d0
    [ 554.881672] should_failslab+0xec/0x120
    [ 554.881684] __kmalloc+0x63/0x760
    [ 554.881692] ? lock_downgrade+0x990/0x990
    [ 554.881712] ? register_shrinker+0x10e/0x2d0
    [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
    [ 554.881737] register_shrinker+0x10e/0x2d0
    [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
    [ 554.881755] ? _down_write_nest_lock+0x120/0x120
    [ 554.881765] ? memcpy+0x45/0x50
    [ 554.881785] sget_userns+0xbcd/0xe20
    (...snipped...)
    [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
    [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 554.898732] general protection fault: 0000 [#1] SMP KASAN
    [ 554.898737] Dumping ftrace buffer:
    [ 554.898741] (ftrace buffer empty)
    [ 554.898743] Modules linked in:
    [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
    [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
    [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
    [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
    [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
    [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
    [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
    [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
    [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    [ 554.898818] Call Trace:
    [ 554.898828] unregister_shrinker+0x79/0x300
    [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
    [ 554.898844] ? down_write+0x87/0x120
    [ 554.898851] ? deactivate_super+0x139/0x1b0
    [ 554.898857] ? down_read+0x150/0x150
    [ 554.898864] ? check_same_owner+0x320/0x320
    [ 554.898875] deactivate_locked_super+0x64/0xd0
    [ 554.898883] deactivate_super+0x141/0x1b0
    ----------

    Since allowing register_shrinker() callers to call unregister_shrinker()
    when register_shrinker() failed can simplify error recovery path, this
    patch makes unregister_shrinker() no-op when register_shrinker() failed.
    Also, reset shrinker->nr_deferred in case unregister_shrinker() was
    by error called twice.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Aliaksei Karaliou
    Reported-by: syzbot
    Cc: Glauber Costa
    Cc: Al Viro
    Signed-off-by: Al Viro

    Tetsuo Handa
     

16 Nov, 2017

2 commits

  • Most callers users of free_hot_cold_page claim the pages being released
    are cache hot. The exception is the page reclaim paths where it is
    likely that enough pages will be freed in the near future that the
    per-cpu lists are going to be recycled and the cache hotness information
    is lost. As no one really cares about the hotness of pages being
    released to the allocator, just ditch the parameter.

    The APIs are renamed to indicate that it's no longer about hot/cold
    pages. It should also be less confusing as there are subtle differences
    between them. __free_pages drops a reference and frees a page when the
    refcount reaches zero. free_hot_cold_page handled pages whose refcount
    was already zero which is non-obvious from the name. free_unref_page
    should be more obvious.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    [mgorman@techsingularity.net: add pages to head, not tail]
    Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
    Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") 'pgdat->inactive_ratio' is not used, except for printing
    "node_inactive_ratio: 0" in /proc/zoneinfo output.

    Remove it.

    Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

03 Oct, 2017

1 commit


07 Sep, 2017

4 commits

  • When swapping out THP (Transparent Huge Page), instead of swapping out
    the THP as a whole, sometimes we have to fallback to split the THP into
    normal pages before swapping, because no free swap clusters are
    available, or cgroup limit is exceeded, etc. To count the number of the
    fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
    we fallback to split the THP.

    Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In this patch, splitting transparent huge page (THP) during swapping out
    is delayed from after adding the THP into the swap cache to after
    swapping out finishes. After the patch, more operations for the
    anonymous THP reclaiming, such as writing the THP to the swap device,
    removing the THP from the swap cache could be batched. So that the
    performance of anonymous THP swapping out could be improved.

    This is the second step for the THP swap support. The plan is to delay
    splitting the THP step by step and avoid splitting the THP finally.

    With the patchset, the swap out throughput improves 42% (from about
    5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
    with 16 processes. At the same time, the IPI (reflect TLB flushing)
    reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Link: http://lkml.kernel.org/r/20170724051840.2309-12-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
    stuck in too_many_isolated loop basically for ever because the last few
    pages on the LRU lists are isolated by the kswapd which is stuck on fs
    locks when doing the pageout or slab reclaim. This in turn means that
    there is nobody to actually trigger the oom killer and the system is
    basically unusable.

    too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
    throttle direct reclaim when too many pages are isolated already") to
    prevent from pre-mature oom killer invocations because back then no
    reclaim progress could indeed trigger the OOM killer too early.

    But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
    rework oom detection") the allocation/reclaim retry loop considers all
    the reclaimable pages and throttles the allocation at that layer so we
    can loosen the direct reclaim throttling.

    Make shrink_inactive_list loop over too_many_isolated bounded and
    returns immediately when the situation hasn't resolved after the first
    sleep.

    Replace congestion_wait by a simple schedule_timeout_interruptible
    because we are not really waiting on the IO congestion in this path.

    Please note that this patch can theoretically cause the OOM killer to
    trigger earlier while there are many pages isolated for the reclaim
    which makes progress only very slowly. This would be obvious from the
    oom report as the number of isolated pages are printed there. If we
    ever hit this should_reclaim_retry should consider those numbers in the
    evaluation in one way or another.

    [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
    [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
    [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

    [mhocko@suse.com: switch to uninterruptible sleep]
    Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some shrinkers may only be able to free a bunch of objects at a time,
    and so free more than the requested nr_to_scan in one pass.

    Whilst other shrinkers may find themselves even unable to scan as many
    objects as they counted, and so underreport. Account for the extra
    freed/scanned objects against the total number of objects we intend to
    scan, otherwise we may end up penalising the slab far more than
    intended. Similarly, we want to add the underperforming scan to the
    deferred pass so that we try harder and harder in future passes.

    Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
    Signed-off-by: Chris Wilson
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Joonas Lahtinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

10 Aug, 2017

1 commit

  • A while ago someone, and I cannot find the email just now, asked if we
    could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
    like we use for other things like workqueues etc. I think this should
    be possible which allows reducing the 'irq' states and will reduce the
    amount of __bfs() lookups we do.

    Removing the 1 IRQ state results in 4 less __bfs() walks per
    dependency, improving lockdep performance. And by moving this
    annotation out of the lockdep code it becomes easier for the mm people
    to extend.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: iamjoonsoo.kim@lge.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jul, 2017

1 commit

  • The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan:
    do not swap anon pages just because free+file is low'") reintroduces is
    to prefer swapping anonymous memory rather than trashing the file lru.

    If the anonymous inactive lru for the set of eligible zones is
    considered low, however, or the length of the list for the given reclaim
    priority does not allow for effective anonymous-only reclaiming, then
    avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the
    small list and leave unreclaimed memory on the file lrus.

    If the inactive list is insufficient, fallback to balanced reclaim so
    the file lru doesn't remain untouched.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Suggested-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Jul, 2017

7 commits

  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • If there is no compound map for a THP (Transparent Huge Page), it is
    possible that the map count of some sub-pages of the THP is 0. So it is
    better to split the THP before swapping out. In this way, the sub-pages
    not mapped will be freed, and we can avoid the unnecessary swap out
    operations for these sub-pages.

    Link: http://lkml.kernel.org/r/20170515112522.32457-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To swap out THP (Transparent Huage Page), before splitting the THP, the
    swap cluster will be allocated and the THP will be added into the swap
    cache. But it is possible that the THP cannot be split, so that we must
    delete the THP from the swap cache and free the swap cluster. To avoid
    that, in this patch, whether the THP can be split is checked firstly.
    The check can only be done racy, but it is good enough for most cases.

    With the patch, the swap out throughput improves 3.6% (from about
    4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Kirill A. Shutemov [for can_split_huge_page()]
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
    so if it fails due to lack of space in case of THP or something(hdd swap
    but tries THP swapout) *caller* rather than add_to_swap itself should
    split the THP page and retry it with base page which is more natural.

    Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Clang and its -Wunsequenced emits a warning

    mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced]
    .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
    ^

    While it is not clear to me whether the initialization code violates the
    specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the
    code is quite confusing and worth cleaning up anyway. Fix this by
    reusing sc.gfp_mask rather than the updated input gfp_mask parameter.

    Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.com
    Signed-off-by: Nick Desaulniers
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Desaulniers
     

23 May, 2017

1 commit

  • To enable smp_processor_id() and might_sleep() debug checks earlier, it's
    required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

    Adjust the system_state check in kswapd_run() to handle the extra states.

    Tested-by: Mark Rutland
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt (VMware)
    Acked-by: Vlastimil Babka
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170516184736.119158930@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

13 May, 2017

1 commit

  • Although there are a ton of free swap and anonymous LRU page in elgible
    zones, OOM happened.

    balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
    CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x21d/0x3f0
    out_of_memory+0xd8/0x390
    __alloc_pages_slowpath+0xbc1/0xc50
    __alloc_pages_nodemask+0x1a5/0x1c0
    pte_alloc_one+0x20/0x50
    __pte_alloc+0x1e/0x110
    __handle_mm_fault+0x919/0x960
    handle_mm_fault+0x77/0x120
    __do_page_fault+0x27a/0x550
    trace_do_page_fault+0x43/0x150
    do_async_page_fault+0x2c/0x90
    async_page_fault+0x28/0x30
    Mem-Info:
    active_anon:424716 inactive_anon:65314 isolated_anon:0
    active_file:52 inactive_file:46 isolated_file:0
    unevictable:0 dirty:27 writeback:0 unstable:0
    slab_reclaimable:3967 slab_unreclaimable:4125
    mapped:133 shmem:43 pagetables:1674 bounce:0
    free:4637 free_pcp:225 free_cma:0
    Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 992 992 1952
    DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 959
    Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
    DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
    Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
    378 total pagecache pages
    17 pages in swap cache
    Swap cache stats: add 17325, delete 17302, find 0/27
    Free swap = 978940kB
    Total swap = 1048572kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    12629 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned
    [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
    [ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
    [ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd

    With investigation, skipping page of isolate_lru_pages makes reclaim
    void because it returns zero nr_taken easily so LRU shrinking is
    effectively nothing and just increases priority aggressively. Finally,
    OOM happens.

    The problem is that get_scan_count determines nr_to_scan with eligible
    zones so although priority drops to zero, it couldn't reclaim any pages
    if the LRU contains mostly ineligible pages.

    get_scan_count:

    size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
    size = size >> sc->priority;

    Assumes sc->priority is 0 and LRU list is as follows.

    N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

    (Ie, small eligible pages are in the head of LRU but others are
    almost ineligible pages)

    In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
    tail of the LRU are not eligible pages. If get_scan_count counts
    skipped pages, it doesn't reclaim any pages remained after scanning 4
    pages so it ends up OOM happening.

    This patch makes isolate_lru_pages try to scan pages until it encounters
    eligible zones's pages.

    [akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
    Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
    Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

09 May, 2017

1 commit

  • The previous patch ("mm: prevent potential recursive reclaim due to
    clearing PF_MEMALLOC") has shown that simply setting and clearing
    PF_MEMALLOC in current->flags can result in wrongly clearing a
    pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
    Let's introduce helpers that support proper nesting by saving the
    previous stat of the flag, similar to the existing memalloc_noio_* and
    memalloc_nofs_* helpers. Convert existing setting/clearing of
    PF_MEMALLOC within mm to the new helpers.

    There are no known issues with the converted code, but the change makes
    it more robust.

    Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrey Ryabinin
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

04 May, 2017

7 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used
    spin_trylock(&mm->page_table_lock) so it's really easy to contend and
    fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.

    However, now we changed it to mutex-based lock and be able to block
    without skip pte so there is few of small window to return SWAP_AGAIN so
    remove SWAP_AGAIN and just return SWAP_FAIL.

    [1] c48c43e, minimal rmap

    Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
    because it means the page is not-swappable so it should move to another
    LRU list(active or unevictable). putback friends will move it to right
    list depending on the page's LRU flag.

    Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim