Eric Lee / smarc-fsl-linux-kernel

06 Apr, 2018

3 commits

5ecd9d403 mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory ... Browse Code »

Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.

This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.

When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:27 +0800
e830c63a6 mm,vmscan: don't pretend forward progress upon shrinker_rwsem contention ... Browse Code »

Since we no longer use return value of shrink_slab() for normal reclaim,
the comment is no longer true. If some do_shrink_slab() call takes
unexpectedly long (root cause of stall is currently unknown) when
register_shrinker()/unregister_shrinker() is pending, trying to drop
caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
loop if many mem_cgroup are defined. For safety, let's not pretend
forward progress.

Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Dave Chinner
Cc: Glauber Costa
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2018-04-06 12:36:25 +0800
e92bb4dd9 mm: fix races between address_space dereference and free in page_evicatable ... Browse Code »

When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.

CPU1 CPU2

truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

mapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.

Similar race exists between swap cache freeing and page_evicatable()
too.

The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.

Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Jan Kara
Reviewed-by: Andrew Morton
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-04-06 12:36:25 +0800

23 Mar, 2018

1 commit

1c610d5f9 mm/vmscan: wake up flushers for legacy cgroups too ... Browse Code »

Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list()
when many dirty pages on the LRU are encountered.

However, shrink_inactive_list() doesn't wake up flushers for legacy
cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
flusher wakeup from direct reclaim path") removed the only source of
flusher's wake up in legacy mem cgroup reclaim path.

This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed

dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50

Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Wake up flushers in legacy cgroup reclaim too.

Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin
Tested-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2018-03-23 08:07:01 +0800

22 Feb, 2018

1 commit

9c4e6b1a7 mm, mlock, vmscan: no more skipping pagevecs ... Browse Code »

When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.

The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.

This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

#0: __pagevec_lru_add_fn #1: clear_page_mlock

SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() //
Acked-by: Vlastimil Babka
Cc: Jérôme Glisse
Cc: Huang Ying
Cc: Tim Chen
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Shaohua Li
Cc: Jan Kara
Cc: Nicholas Piggin
Cc: Dan Williams
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-02-22 07:35:42 +0800

07 Feb, 2018

1 commit

a5d09bed7 mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors ... Browse Code »

Link: http://lkml.kernel.org/r/1516700871-22279-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-02-07 10:32:48 +0800

01 Feb, 2018

4 commits

69d763fc6 mm: pin address_space before dereferencing it while isolating an LRU page ... Browse Code »

Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows

CPU1 CPU2

truncate(inode) __isolate_lru_page()
...
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

The race is theoretically possible but unlikely. Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page. Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.

This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired. There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed. However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.

Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman
Acked-by: Minchan Kim
Cc: "Huang, Ying"
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2018-02-01 09:18:39 +0800
a4ef87684 mm: remove unused pgdat_reclaimable_pages() ... Browse Code »

Remove unused function pgdat_reclaimable_pages() and
node_page_state_snapshot() which becomes unused as well.

Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-02-01 09:18:37 +0800
e496612c5 mm: do not stall register_shrinker() ... Browse Code »

Shakeel Butt reported he has observed in production systems that the job
loader gets stuck for 10s of seconds while doing a mount operation. It
turns out that it was stuck in register_shrinker() because some
unrelated job was under memory pressure and was spending time in
shrink_slab(). Machines have a lot of shrinkers registered and jobs
under memory pressure have to traverse all of those memcg-aware
shrinkers and affect unrelated jobs which want to register their own
shrinkers.

To solve the issue, this patch simply bails out slab shrinking if it is
found that someone wants to register a shrinker in parallel. A downside
is it could cause unfair shrinking between shrinkers. However, it
should be rare and we can add compilcated logic if we find it's not
enough.

[akpm@linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
Link: http://lkml.kernel.org/r/1511481899-20335-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Signed-off-by: Shakeel Butt
Reported-by: Shakeel Butt
Tested-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-02-01 09:18:37 +0800
9092c71bb mm: use sc->priority for slab shrink targets ... Browse Code »

Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.

Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab

Active: 58840 kB
Inactive: 46860 kB

Every time we do a get_scan_count() we do this

scan = size >> sc->priority

where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.

Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following

slab_pages = (nr_objects * object_size) / PAGE_SIZE

What we would like to do is

scan = slab_pages >> sc->priority

but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following

scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan

or written another way

scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)

which can thus be written

scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority

which is just

scan_objects = nr_objects >> sc->priority

We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.

Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik
Acked-by: Johannes Weiner
Acked-by: Dave Chinner
Acked-by: Andrey Ryabinin
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef Bacik
2018-02-01 09:18:36 +0800

19 Dec, 2017

1 commit

bb422a738 mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed. ... Browse Code »

Syzbot caught an oops at unregister_shrinker() because combination of
commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
injection made register_shrinker() fail and the caller of
register_shrinker() did not check for failure.

----------
[ 554.881422] FAULT_INJECTION: forcing a failure.
[ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
[ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.881445] Call Trace:
[ 554.881459] dump_stack+0x194/0x257
[ 554.881474] ? arch_local_irq_restore+0x53/0x53
[ 554.881486] ? find_held_lock+0x35/0x1d0
[ 554.881507] should_fail+0x8c0/0xa40
[ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
[ 554.881537] ? check_noncircular+0x20/0x20
[ 554.881546] ? find_next_zero_bit+0x2c/0x40
[ 554.881560] ? ida_get_new_above+0x421/0x9d0
[ 554.881577] ? find_held_lock+0x35/0x1d0
[ 554.881594] ? __lock_is_held+0xb6/0x140
[ 554.881628] ? check_same_owner+0x320/0x320
[ 554.881634] ? lock_downgrade+0x990/0x990
[ 554.881649] ? find_held_lock+0x35/0x1d0
[ 554.881672] should_failslab+0xec/0x120
[ 554.881684] __kmalloc+0x63/0x760
[ 554.881692] ? lock_downgrade+0x990/0x990
[ 554.881712] ? register_shrinker+0x10e/0x2d0
[ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
[ 554.881737] register_shrinker+0x10e/0x2d0
[ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
[ 554.881755] ? _down_write_nest_lock+0x120/0x120
[ 554.881765] ? memcpy+0x45/0x50
[ 554.881785] sget_userns+0xbcd/0xe20
(...snipped...)
[ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
[ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 554.898732] general protection fault: 0000 [#1] SMP KASAN
[ 554.898737] Dumping ftrace buffer:
[ 554.898741] (ftrace buffer empty)
[ 554.898743] Modules linked in:
[ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
[ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
[ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
[ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
[ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
[ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
[ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
[ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
[ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
[ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
[ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 554.898818] Call Trace:
[ 554.898828] unregister_shrinker+0x79/0x300
[ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
[ 554.898844] ? down_write+0x87/0x120
[ 554.898851] ? deactivate_super+0x139/0x1b0
[ 554.898857] ? down_read+0x150/0x150
[ 554.898864] ? check_same_owner+0x320/0x320
[ 554.898875] deactivate_locked_super+0x64/0xd0
[ 554.898883] deactivate_super+0x141/0x1b0
----------

Since allowing register_shrinker() callers to call unregister_shrinker()
when register_shrinker() failed can simplify error recovery path, this
patch makes unregister_shrinker() no-op when register_shrinker() failed.
Also, reset shrinker->nr_deferred in case unregister_shrinker() was
by error called twice.

Signed-off-by: Tetsuo Handa
Signed-off-by: Aliaksei Karaliou
Reported-by: syzbot
Cc: Glauber Costa
Cc: Al Viro
Signed-off-by: Al Viro

Tetsuo Handa
2017-12-19 04:03:09 +0800

16 Nov, 2017

2 commits

2d4894b5d mm: remove cold parameter from free_hot_cold_page* ... Browse Code »

Most callers users of free_hot_cold_page claim the pages being released
are cache hot. The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost. As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages. It should also be less confusing as there are subtle differences
between them. __free_pages drops a reference and frees a page when the
refcount reaches zero. free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name. free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[mgorman@techsingularity.net: add pages to head, not tail]
Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-11-16 10:21:06 +0800
3a50d14d0 mm: remove unused pgdat->inactive_ratio ... Browse Code »

Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") 'pgdat->inactive_ratio' is not used, except for printing
"node_inactive_ratio: 0" in /proc/zoneinfo output.

Remove it.

Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2017-11-16 10:21:03 +0800

15 Nov, 2017

1 commit

e2c5923c3 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.

Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:

- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.

- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.

- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)

- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)

- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).

- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).

- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).

- Fix laptop mode on blk-mq (me).

- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).

- blktrace race fix, oopsing on sg load (me).

- blk-mq optimizations (me).

- Obscure waitqueue race fix for kyber (Omar).

- NBD fixes (Josef).

- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).

- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.

- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.

- BFQ updates (Paolo).

- blk-mq atomic flags memory ordering fixes (Peter Z).

- Loop cgroup support (Shaohua).

- Lots of minor fixes from lots of different folks, both for core and
driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...

Linus Torvalds
2017-11-15 07:32:19 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

03 Oct, 2017

1 commit

9ba4b2dfa fs: kill 'nr_pages' argument from wakeup_flusher_threads() ... Browse Code »

Everybody is passing in 0 now, let's get rid of the argument.

Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-10-03 22:38:17 +0800

07 Sep, 2017

4 commits

fe490cc0f mm, THP, swap: add THP swapping out fallback counting ... Browse Code »

When swapping out THP (Transparent Huge Page), instead of swapping out
the THP as a whole, sometimes we have to fallback to split the THP into
normal pages before swapping, because no free swap clusters are
available, or cgroup limit is exceeded, etc. To count the number of the
fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
we fallback to split the THP.

Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jens Axboe
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
bd4c82c22 mm, THP, swap: delay splitting THP after swapped out ... Browse Code »

In this patch, splitting transparent huge page (THP) during swapping out
is delayed from after adding the THP into the swap cache to after
swapping out finishes. After the patch, more operations for the
anonymous THP reclaiming, such as writing the THP to the swap device,
removing the THP from the swap cache could be batched. So that the
performance of anonymous THP swapping out could be improved.

This is the second step for the THP swap support. The plan is to delay
splitting the THP step by step and avoid splitting the THP finally.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes. At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap
device used is a RAM simulated PMEM (persistent memory) device. To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

Link: http://lkml.kernel.org/r/20170724051840.2309-12-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jens Axboe
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
db73ee0d4 mm, vmscan: do not loop on too_many_isolated for ever ... Browse Code »

Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
stuck in too_many_isolated loop basically for ever because the last few
pages on the LRU lists are isolated by the kswapd which is stuck on fs
locks when doing the pageout or slab reclaim. This in turn means that
there is nobody to actually trigger the oom killer and the system is
basically unusable.

too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
throttle direct reclaim when too many pages are isolated already") to
prevent from pre-mature oom killer invocations because back then no
reclaim progress could indeed trigger the OOM killer too early.

But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
rework oom detection") the allocation/reclaim retry loop considers all
the reclaimable pages and throttles the allocation at that layer so we
can loosen the direct reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and
returns immediately when the situation hasn't resolved after the first
sleep.

Replace congestion_wait by a simple schedule_timeout_interruptible
because we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the
oom report as the number of isolated pages are printed there. If we
ever hit this should_reclaim_retry should consider those numbers in the
evaluation in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

[mhocko@suse.com: switch to uninterruptible sleep]
Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Rik van Riel
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:26 +0800
d460acb5b mm: track actual nr_scanned during shrink_slab() ... Browse Code »

Some shrinkers may only be able to free a bunch of objects at a time,
and so free more than the requested nr_to_scan in one pass.

Whilst other shrinkers may find themselves even unable to scan as many
objects as they counted, and so underreport. Account for the extra
freed/scanned objects against the total number of objects we intend to
scan, otherwise we may end up penalising the slab far more than
intended. Similarly, we want to add the underperforming scan to the
deferred pass so that we try harder and harder in future passes.

Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Hillf Danton
Cc: Minchan Kim
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Joonas Lahtinen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wilson
2017-09-07 08:27:24 +0800

10 Aug, 2017

1 commit

d92a8cfcb locking/lockdep: Rework FS_RECLAIM annotation ... Browse Code »

A while ago someone, and I cannot find the email just now, asked if we
could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
like we use for other things like workqueues etc. I think this should
be possible which allows reducing the 'irq' states and will reduce the
amount of __bfs() lookups we do.

Removing the 1 IRQ state results in 4 less __bfs() walks per
dependency, improving lockdep performance. And by moving this
annotation out of the lockdep code it becomes easier for the mm people
to extend.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Byungchul Park
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Nikolay Borisov
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-08-10 18:29:03 +0800

13 Jul, 2017

1 commit

dcda9b047 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic ... Browse Code »

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)

- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim

- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.

- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.

- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).

- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.

- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.

- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Alex Belits
Cc: Chris Wilson
Cc: Christoph Hellwig
Cc: Darrick J. Wong
Cc: David Daney
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: NeilBrown
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-13 07:26:03 +0800

11 Jul, 2017

1 commit

062262267 mm, vmscan: avoid thrashing anon lru when free + file is low ... Browse Code »

The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan:
do not swap anon pages just because free+file is low'") reintroduces is
to prefer swapping anonymous memory rather than trashing the file lru.

If the anonymous inactive lru for the set of eligible zones is
considered low, however, or the length of the list for the given reclaim
priority does not allow for effective anonymous-only reclaiming, then
avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the
small list and leave unreclaimed memory on the file lrus.

If the inactive list is insufficient, fallback to balanced reclaim so
the file lru doesn't remain untouched.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Suggested-by: Minchan Kim
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-07-11 07:32:30 +0800

07 Jul, 2017

7 commits

385386cff mm: vmstat: move slab statistics from zone to node counters ... Browse Code »

Patch series "mm: per-lruvec slab stats"

Josef is working on a new approach to balancing slab caches and the page
cache. For this to work, he needs slab cache statistics on the lruvec
level. These patches implement that by adding infrastructure that
allows updating and reading generic VM stat items per lruvec, then
switches some existing VM accounting sites, including the slab
accounting ones, to this new cgroup-aware API.

I'll follow up with more patches on this, because there is actually
substantial simplification that can be done to the memory controller
when we replace private memcg accounting with making the existing VM
accounting sites cgroup-aware. But this is enough for Josef to base his
slab reclaim work on, so here goes.

This patch (of 5):

To re-implement slab cache vs. page cache balancing, we'll need the
slab counters at the lruvec level, which, ever since lru reclaim was
moved from the zone to the node, is the intersection of the node, not
the zone, and the memcg.

We could retain the per-zone counters for when the page allocator dumps
its memory information on failures, and have counters on both levels -
which on all but NUMA node 0 is usually redundant. But let's keep it
simple for now and just move them. If anybody complains we can restore
the per-zone counters.

[hannes@cmpxchg.org: fix oops]
Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
2262185c5 mm: per-cgroup memory reclaim stats ... Browse Code »

Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin
Suggested-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2017-07-07 07:24:35 +0800
747552b1e mm, THP, swap: enable THP swap optimization only if has compound map ... Browse Code »

If there is no compound map for a THP (Transparent Huge Page), it is
possible that the map count of some sub-pages of the THP is 0. So it is
better to split the THP before swapping out. In this way, the sub-pages
not mapped will be freed, and we can avoid the unnecessary swap out
operations for these sub-pages.

Link: http://lkml.kernel.org/r/20170515112522.32457-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-07-07 07:24:31 +0800
b8f593cd0 mm, THP, swap: check whether THP can be split firstly ... Browse Code »

To swap out THP (Transparent Huage Page), before splitting the THP, the
swap cluster will be allocated and the THP will be added into the swap
cache. But it is possible that the THP cannot be split, so that we must
delete the THP from the swap cache and free the swap cluster. To avoid
that, in this patch, whether the THP can be split is checked firstly.
The check can only be done racy, but it is good enough for most cases.

With the patch, the swap out throughput improves 3.6% (from about
4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes. The test is done on a Xeon E5 v3 system. The swap
device used is a RAM simulated PMEM (persistent memory) device. To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Acked-by: Kirill A. Shutemov [for can_split_huge_page()]
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Hugh Dickins
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-07-07 07:24:31 +0800
0f0746589 mm, THP, swap: move anonymous THP split logic to vmscan ... Browse Code »

The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
so if it fails due to lack of space in case of THP or something(hdd swap
but tries THP swapout) *caller* rather than add_to_swap itself should
split the THP page and retry it with base page which is more natural.

Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
Signed-off-by: Minchan Kim
Signed-off-by: "Huang, Ying"
Acked-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-07-07 07:24:31 +0800
75f6d6d29 mm, THP, swap: unify swap slot free functions to put_swap_page ... Browse Code »

Now, get_swap_page takes struct page and allocates swap space according
to page size(ie, normal or THP) so it would be more cleaner to introduce
put_swap_page which is a counter function of get_swap_page. Then, it
calls right swap slot free function depending on page's size.

[ying.huang@intel.com: minor cleanup and fix]
Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
Signed-off-by: Minchan Kim
Signed-off-by: "Huang, Ying"
Acked-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-07-07 07:24:31 +0800
f2f43e566 mm/vmscan.c: fix unsequenced modification and access warning ... Browse Code »

Clang and its -Wunsequenced emits a warning

mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced]
.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
^

While it is not clear to me whether the initialization code violates the
specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the
code is quite confusing and worth cleaning up anyway. Fix this by
reusing sc.gfp_mask rather than the updated input gfp_mask parameter.

Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.com
Signed-off-by: Nick Desaulniers
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Desaulniers
2017-07-07 07:24:31 +0800

23 May, 2017

1 commit

c6202adf3 mm/vmscan: Adjust system_state checks ... Browse Code »

To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in kswapd_run() to handle the extra states.

Tested-by: Mark Rutland
Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Steven Rostedt (VMware)
Acked-by: Vlastimil Babka
Cc: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: Johannes Weiner
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20170516184736.119158930@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2017-05-23 16:01:37 +0800

13 May, 2017

1 commit

791b48b64 mm: vmscan: scan until it finds eligible pages ... Browse Code »

Although there are a ton of free swap and anonymous LRU page in elgible
zones, OOM happened.

balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x21d/0x3f0
out_of_memory+0xd8/0x390
__alloc_pages_slowpath+0xbc1/0xc50
__alloc_pages_nodemask+0x1a5/0x1c0
pte_alloc_one+0x20/0x50
__pte_alloc+0x1e/0x110
__handle_mm_fault+0x919/0x960
handle_mm_fault+0x77/0x120
__do_page_fault+0x27a/0x550
trace_do_page_fault+0x43/0x150
do_async_page_fault+0x2c/0x90
async_page_fault+0x28/0x30
Mem-Info:
active_anon:424716 inactive_anon:65314 isolated_anon:0
active_file:52 inactive_file:46 isolated_file:0
unevictable:0 dirty:27 writeback:0 unstable:0
slab_reclaimable:3967 slab_unreclaimable:4125
mapped:133 shmem:43 pagetables:1674 bounce:0
free:4637 free_pcp:225 free_cma:0
Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 992 992 1952
DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
lowmem_reserve[]: 0 0 0 959
Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
378 total pagecache pages
17 pages in swap cache
Swap cache stats: add 17325, delete 17302, find 0/27
Free swap = 978940kB
Total swap = 1048572kB
524157 pages RAM
0 pages HighMem/MovableOnly
12629 pages reserved
0 pages cma reserved
0 pages hwpoisoned
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
[ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd

With investigation, skipping page of isolate_lru_pages makes reclaim
void because it returns zero nr_taken easily so LRU shrinking is
effectively nothing and just increases priority aggressively. Finally,
OOM happens.

The problem is that get_scan_count determines nr_to_scan with eligible
zones so although priority drops to zero, it couldn't reclaim any pages
if the LRU contains mostly ineligible pages.

get_scan_count:

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
size = size >> sc->priority;

Assumes sc->priority is 0 and LRU list is as follows.

N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

(Ie, small eligible pages are in the head of LRU but others are
almost ineligible pages)

In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
tail of the LRU are not eligible pages. If get_scan_count counts
skipped pages, it doesn't reclaim any pages remained after scanning 4
pages so it ends up OOM happening.

This patch makes isolate_lru_pages try to scan pages until it encounters
eligible zones's pages.

[akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-13 06:57:16 +0800

09 May, 2017

1 commit

499118e96 mm: introduce memalloc_noreclaim_{save,restore} ... Browse Code »

The previous patch ("mm: prevent potential recursive reclaim due to
clearing PF_MEMALLOC") has shown that simply setting and clearing
PF_MEMALLOC in current->flags can result in wrongly clearing a
pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
Let's introduce helpers that support proper nesting by saving the
previous stat of the flag, similar to the existing memalloc_noio_* and
memalloc_nofs_* helpers. Convert existing setting/clearing of
PF_MEMALLOC within mm to the new helpers.

There are no known issues with the converted code, but the change makes
it more robust.

Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Andrey Ryabinin
Cc: Boris Brezillon
Cc: Chris Leech
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: Josef Bacik
Cc: Lee Duncan
Cc: Michal Hocko
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2017-05-09 08:15:15 +0800

04 May, 2017

7 commits

ccda7f436 mm: memcontrol: use node page state naming scheme for memcg ... Browse Code »

The memory controllers stat function names are awkwardly long and
arbitrarily different from the zone and node stat functions.

The current interface is named:

mem_cgroup_read_stat()
mem_cgroup_update_stat()
mem_cgroup_inc_stat()
mem_cgroup_dec_stat()
mem_cgroup_update_page_stat()
mem_cgroup_inc_page_stat()
mem_cgroup_dec_page_stat()

This patch renames it to match the corresponding node stat functions:

memcg_page_state() [node_page_state()]
mod_memcg_state() [mod_node_state()]
inc_memcg_state() [inc_node_state()]
dec_memcg_state() [dec_node_state()]
mod_memcg_page_state() [mod_node_page_state()]
inc_memcg_page_state() [inc_node_page_state()]
dec_memcg_page_state() [dec_node_page_state()]

Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
71cd31135 mm: memcontrol: re-use node VM page state enum ... Browse Code »

The current duplication is a high-maintenance mess, and it's painful to
add new items or query memcg state from the rest of the VM.

This increases the size of the stat array marginally, but we should aim
to track all these stats on a per-cgroup level anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
31176c781 mm: memcontrol: clean up memory.events counting function ... Browse Code »

We only ever count single events, drop the @nr parameter. Rename the
function accordingly. Remove low-information kerneldoc.

Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
2a2e48854 mm: vmscan: fix IO/refault regression in cache workingset transition ... Browse Code »

Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.

The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO. However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.

This patch disables active list protection when refaults are being
observed. This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.

The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.

Stateful services that handle user data tend to be more conservative
with kernel upgrades. As a result we hit most page cache issues with
some delay, as was the case here.

The severity seemed to warrant a stable tag.

Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
666e5a406 mm: make ttu's return boolean ... Browse Code »

try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
boolean return. This patch changes it.

Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Naoya Horiguchi
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-04 06:52:10 +0800
33fc80e25 mm: remove SWAP_AGAIN in ttu ... Browse Code »

In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used
spin_trylock(&mm->page_table_lock) so it's really easy to contend and
fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.

However, now we changed it to mutex-based lock and be able to block
without skip pte so there is few of small window to return SWAP_AGAIN so
remove SWAP_AGAIN and just return SWAP_FAIL.

[1] c48c43e, minimal rmap

Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-04 06:52:10 +0800
ad6b67041 mm: remove SWAP_MLOCK in ttu ... Browse Code »

ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
because it means the page is not-swappable so it should move to another
LRU list(active or unevictable). putback friends will move it to right
list depending on the page's LRU flag.

Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-04 06:52:10 +0800