Eric Lee / smarc-fsl-linux-kernel

31 Aug, 2022

1 commit

f96b9f7c1 writeback: avoid use-after-free after removing device ... Browse Code »

commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream.

When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete. However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov
Reviewed-by: Jan Kara
Cc: Michael Stapelberg
Cc: Wu Fengguang
Cc: Alexander Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Khazhismel Kumykov
2022-08-31 23:16:47 +0800

04 Sep, 2021

6 commits

14726903c Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton : (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...

Linus Torvalds
2021-09-04 01:08:28 +0800
20792ebf3 writeback: use READ_ONCE for unlocked reads of writeback stats ... Browse Code »

We do some unlocked reads of writeback statistics like
avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are
fine with getting somewhat out-of-date values but actually getting
different values in various parts of the functions because the compiler
decided to reload value from original memory location could confuse
calculations. Use READ_ONCE for these unlocked accesses and WRITE_ONCE
for the updates to be on the safe side.

Link: https://lkml.kernel.org/r/20210713104716.22868-5-jack@suse.cz
Signed-off-by: Jan Kara
Cc: Michael Stapelberg
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2021-09-04 00:58:10 +0800
42dd235cb writeback: rename domain_update_bandwidth() ... Browse Code »

Rename domain_update_bandwidth() to domain_update_dirty_limit(). The
original name is a misnomer. The function has nothing to do with a
bandwidth, it updates dirty limits.

Link: https://lkml.kernel.org/r/20210713104716.22868-4-jack@suse.cz
Signed-off-by: Jan Kara
Cc: Michael Stapelberg
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2021-09-04 00:58:10 +0800
45a2966fd writeback: fix bandwidth estimate for spiky workload ... Browse Code »

Michael Stapelberg has reported that for workload with short big spikes of
writes (GCC linker seem to trigger this frequently) the write throughput
is heavily underestimated and tends to steadily sink until it reaches
zero. This has rather bad impact on writeback throttling (causing
stalls). The problem is that writeback throughput estimate gets updated
at most once per 200 ms. One update happens early after we submit pages
for writeback (at that point writeout of only small fraction of pages is
completed and thus observed throughput is tiny). Next update happens only
during the next write spike (updates happen only from inode writeback and
dirty throttling code) and if that is more than 1s after previous spike,
we decide system was idle and just ignore whatever was written until this
moment.

Fix the problem by making sure writeback throughput estimate is also
updated shortly after writeback completes to get reasonable estimate of
throughput for spiky workloads.

[jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()]

Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com
Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz
Signed-off-by: Jan Kara
Reported-by: Michael Stapelberg
Tested-by: Michael Stapelberg
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2021-09-04 00:58:10 +0800
fee468fdf writeback: reliably update bandwidth estimation ... Browse Code »

Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
Signed-off-by: Jan Kara
Cc: Michael Stapelberg
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2021-09-04 00:58:10 +0800
633a2abb9 writeback: track number of inodes under writeback ... Browse Code »

Patch series "writeback: Fix bandwidth estimates", v4.

Fix estimate of writeback throughput when device is not fully busy doing
writeback. Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

This patch (of 4):

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.

Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
Signed-off-by: Jan Kara
Cc: Wu Fengguang
Cc: Michael Stapelberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2021-09-04 00:58:10 +0800

10 Aug, 2021

1 commit

5ed964f8e mm: hide laptop_mode_wb_timer entirely behind the BDI API ... Browse Code »

Don't leak the detaіls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
Signed-off-by: Jens Axboe

Christoph Hellwig
2021-08-10 01:52:28 +0800

01 Jul, 2021

1 commit

df668a5fe Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:

- disk events cleanup (Christoph)

- gendisk and request queue allocation simplifications (Christoph)

- bdev_disk_changed cleanups (Christoph)

- IO priority improvements (Bart)

- Chained bio completion trace fix (Edward)

- blk-wbt fixes (Jan)

- blk-wbt enable/disable fix (Zhang)

- Scheduler dispatch improvements (Jan, Ming)

- Shared tagset scheduler improvements (John)

- BFQ updates (Paolo, Luca, Pietro)

- BFQ lock inversion fix (Jan)

- Documentation improvements (Kir)

- CLONE_IO block cgroup fix (Tejun)

- Remove of ancient and deprecated block dump feature (zhangyi)

- Discard merge fix (Ming)

- Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
block: fix discard request merge
block/mq-deadline: Remove a WARN_ON_ONCE() call
blk-mq: update hctx->dispatch_busy in case of real scheduler
blk: Fix lock inversion between ioc lock and bfqd lock
bfq: Remove merged request already in bfq_requests_merged()
block: pass a gendisk to bdev_disk_changed
block: move bdev_disk_changed
block: add the events* attributes to disk_attrs
block: move the disk events code to a separate file
block: fix trace completion for chained bio
block/partitions/msdos: Fix typo inidicator -> indicator
block, bfq: reset waker pointer with shared queues
block, bfq: check waker only for queues with no in-flight I/O
block, bfq: avoid delayed merge of async queues
block, bfq: boost throughput by extending queue-merging times
block, bfq: consider also creation time in delayed stable merge
block, bfq: fix delayed stable merge check
block, bfq: let also stably merged queues enjoy weight raising
blk-wbt: make sure throttle is enabled properly
blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
...

Linus Torvalds
2021-07-01 03:12:56 +0800

30 Jun, 2021

8 commits

b82a96c92 fs: remove noop_set_page_dirty() ... Browse Code »

Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future. It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.

[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]

Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-06-30 01:53:48 +0800
2f18be363 mm/writeback: use __set_page_dirty in __set_page_dirty_nobuffers ... Browse Code »

This is fundamentally the same code, so just call it instead of
duplicating it.

Link: https://lkml.kernel.org/r/20210615162342.1669332-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Cc: Al Viro
Cc: Dan Williams
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-06-30 01:53:48 +0800
6e1cae881 mm/writeback: move __set_page_dirty() to core mm ... Browse Code »

Patch series "Further set_page_dirty cleanups".

Prompted by Christoph's recent patches, here are some more patches to
improve the state of set_page_dirty(). They're all from the folio tree,
so they've been tested to a certain extent.

This patch (of 6):

Nothing in __set_page_dirty() is specific to buffer_head, so move it to
mm/page-writeback.c. That removes the only caller of
account_page_dirtied() outside of page-writeback.c, so make it static.

Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Cc: Jan Kara
Cc: Al Viro
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-06-30 01:53:48 +0800
0af573780 mm: require ->set_page_dirty to be explicitly wired up ... Browse Code »

Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
that method up for the missing instances.

[hch@lst.de: ecryptfs: add a ->set_page_dirty cludge]
Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de

Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Jan Kara
Cc: Al Viro
Cc: Matthew Wilcox (Oracle)
Cc: Tyler Hicks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2021-06-30 01:53:48 +0800
87e378974 mm/page-writeback: use __this_cpu_inc() in account_page_dirtied() ... Browse Code »

As account_page_dirtied() was always protected by xa_lock_irqsave(), so
using __this_cpu_inc() is better.

Link: https://lkml.kernel.org/r/20210512144742.4764-1-wuchi.zero@gmail.com
Signed-off-by: Chi Wu
Reviewed-by: Jan Kara
Cc: Howard Cochran
Cc: Miklos Szeredi
Cc: Sedat Dilek
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chi Wu
2021-06-30 01:53:47 +0800
032315543 mm/page-writeback: update the comment of Dirty position control ... Browse Code »

As the value of pos_ratio_polynom() clamp between 0 and 2LL <<
RATELIMIT_CALC_SHIFT, the global control line should be consistent with
it.

Link: https://lkml.kernel.org/r/20210511103606.3732-1-wuchi.zero@gmail.com
Signed-off-by: Chi Wu
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Howard Cochran
Cc: Miklos Szeredi
Cc: Sedat Dilek
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chi Wu
2021-06-30 01:53:47 +0800
ab19939a6 mm/page-writeback: Fix performance when BDI's share of ratio is 0. ... Browse Code »

Fix performance when BDI's share of ratio is 0.

The issue is similar to commit 74d369443325 ("writeback: Fix
performance regression in wb_over_bg_thresh()").

Balance_dirty_pages and the writeback worker will also disagree on
whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share
of the thresh ratio is zero.

For example, A thread on cpu0 writes 32 pages and then
balance_dirty_pages, it will wake up background writeback and pauses
because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero).
A thread may runs on cpu0 again because scheduler prefers pre_cpu.
Then writeback worker may runs on other cpus(1,2..) which causes the
value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does
not writeback and returns.

Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
worker who will do nothing. It remains stuck in this state until the
writeback worker hit the right dirty cpu or the dirty pages expire.

The fix that we should get the wb_stat_sum radically when thresh is low.

Link: https://lkml.kernel.org/r/20210428225046.16301-1-wuchi.zero@gmail.com
Signed-off-by: Chi Wu
Reviewed-by: Jan Kara
Cc: Tejun Heo
Cc: Miklos Szeredi
Cc: Sedat Dilek
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chi Wu
2021-06-30 01:53:47 +0800
5defd497e mm: page-writeback: kill get_writeback_state() comments ... Browse Code »

The get_writeback_state() has gone since 2006, kill related comments.

Link: https://lkml.kernel.org/r/20210508125026.56600-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kefeng Wang
2021-06-30 01:53:47 +0800

24 May, 2021

1 commit

3af3d772f block_dump: remove block_dump feature ... Browse Code »

We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.

Signed-off-by: zhangyi (F)
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe

zhangyi (F)
2021-05-24 20:47:21 +0800

07 May, 2021

1 commit

f0953a1bb mm: fix typos in comments ... Browse Code »

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: Randy Dunlap
Cc: Bhaskar Chowdhury
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2021-05-07 15:26:35 +0800

01 May, 2021

1 commit

1c824a680 mm: page-writeback: simplify memcg handling in test_clear_page_writeback() ... Browse Code »

Page writeback doesn't hold a page reference, which allows truncate to
free a page the second PageWriteback is cleared. This used to require
special attention in test_clear_page_writeback(), where we had to be
careful not to rely on the unstable page->memcg binding and look up all
the necessary information before clearing the writeback flag.

Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
explicit reference on the page, and this dance is no longer needed.

Use unlock_page_memcg() and dec_lruvec_page_state() directly.

This removes the last user of the lock_page_memcg() return value, change
it to void. Touch up the comments in there as well. This also removes
the last extern user of __unlock_page_memcg(), make it static. Further,
it removes the last user of dec_lruvec_state(), delete it, along with a
few other unused helpers.

Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2021-05-01 02:20:37 +0800

24 Mar, 2021

1 commit

e5dbd3321 mm/writeback: Add wait_on_page_writeback_killable ... Browse Code »

This is the killable version of wait_on_page_writeback.

Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Christoph Hellwig
Signed-off-by: David Howells
Tested-by: kafs-testing@auristor.com
cc: linux-afs@lists.infradead.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20210320054104.1300774-3-willy@infradead.org

Matthew Wilcox (Oracle)
2021-03-24 04:54:29 +0800

06 Jan, 2021

1 commit

c2407cf7d mm: make wait_on_page_writeback() wait for multiple pending writebacks ... Browse Code »

Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

But syzbot just reported another one, even with that commit in place.

And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use. It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.

Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.

This gives us this fairly simple race condition:

CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()

CPU1 CPU2 CPU3
---- ---- ----

end_page_writeback()
test_clear_page_writeback(page)
... delayed...

lock_page();
set_page_writeback()
unlock_page()

lock_page()
wait_on_page_writeback();

wake_up_page(page, PG_writeback);
.. wakes up CPU3 ..

BUG_ON(PageWriteback(page));

where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.

The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.

The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").

However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.

IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions. Painful.

Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior. This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.

Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins
Cc: Andrew Morton
Cc: Matthew Wilcox
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2021-01-06 03:33:00 +0800

25 Nov, 2020

1 commit

073861ed7 mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) ... Browse Code »

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,

--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.

Devious"

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds

Hugh Dickins
2020-11-25 07:23:19 +0800

17 Oct, 2020

1 commit

8854a6a72 mm/page-writeback: support tail pages in wait_for_stable_page ... Browse Code »

page->mapping is undefined for tail pages, so operate exclusively on the
head page.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Acked-by: Kirill A. Shutemov
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-11-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800

25 Sep, 2020

3 commits

f56753ac2 bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag ... Browse Code »

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
823423ef5 bdi: invert BDI_CAP_NO_ACCT_WB ... Browse Code »

Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
1cb039f3d bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag ... Browse Code »

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800

08 Aug, 2020

1 commit

0a18e6078 mm: remove vm_total_pages ... Browse Code »

The global variable "vm_total_pages" is a relic from older days. There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.

Use a local variable in build_all_zonelists() instead and remove the
global variable.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Pankaj Gupta
Reviewed-by: Mike Rapoport
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Huang Ying
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-08-08 02:33:28 +0800

05 Jun, 2020

1 commit

e0857cf5a mm/page-writeback: fix a typo in comment "effictive"->"effective" ... Browse Code »

There is a typo in comment, fix it.

Signed-off-by: Ethon Paul
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200411003513.14613-1-ethp@qq.com
Signed-off-by: Linus Torvalds

Ethon Paul
2020-06-05 10:06:24 +0800

04 Jun, 2020

1 commit

cb8e59cc8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.

2) Add GSO partial support to igc, from Sasha Neftin.

3) Several cleanups and improvements to r8169 from Heiner Kallweit.

4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.

5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.

6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

8) Add sriov and vf support to hinic, from Luo bin.

9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...

Linus Torvalds
2020-06-04 07:27:18 +0800

03 Jun, 2020

3 commits

8d92890bd mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead ... Browse Code »

After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput. With this
patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.

[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown
Signed-off-by: Andrew Morton
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Acked-by: Trond Myklebust
Acked-by: Michal Hocko [mm]
Cc: Christoph Hellwig
Cc: Chuck Lever
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds

NeilBrown
2020-06-03 01:59:08 +0800
a37b0715d mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE ... Browse Code »

PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks. I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown
Signed-off-by: Andrew Morton
Reviewed-by: Jan Kara
Acked-by: Chuck Lever [nfsd]
Cc: Christoph Hellwig
Cc: Michal Hocko
Cc: Trond Myklebust
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds

NeilBrown
2020-06-03 01:59:08 +0800
28659cc8c mm/page-writeback.c: remove unused variable ... Browse Code »

Commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback
vs writepages deadlock") left unused variable, remove it.

Signed-off-by: Chao Yu
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Link: http://lkml.kernel.org/r/20200528033740.17269-1-yuchao0@huawei.com
Signed-off-by: Linus Torvalds

Chao Yu
2020-06-03 01:59:08 +0800

27 Apr, 2020

1 commit

32927393d sysctl: pass kernel pointers to ->proc_handler ... Browse Code »

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig
Acked-by: Andrey Ignatov
Signed-off-by: Al Viro

Christoph Hellwig
2020-04-27 14:07:40 +0800

03 Apr, 2020

3 commits

f28d43636 mm/gup/writeback: add callbacks for inaccessible pages ... Browse Code »

With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail. We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.

This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space. For example a bad
QEMU could simply start direct I/O on such protected memory. We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed". When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.

On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling we can
tackle this when we know the details.

Signed-off-by: Claudio Imbrenda
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Christian Borntraeger
Reviewed-by: John Hubbard
Acked-by: Will Deacon
Cc: Jan Kara
Cc: Matthew Wilcox
Cc: Ira Weiny
Cc: Jérôme Glisse
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jason Gunthorpe
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Shuah Khan
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
Signed-off-by: Linus Torvalds

Claudio Imbrenda
2020-04-03 00:35:27 +0800
184b4fef5 mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io ... Browse Code »

Dumping the page information in this circumstance helps for debugging.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: Christoph Hellwig
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Pankaj Gupta
Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-04-03 00:35:26 +0800
cc7b8f624 mm/page-writeback.c: write_cache_pages(): deduplicate identical checks ... Browse Code »

There used to be a 'retry' label in between the two (identical) checks
when first introduced in commit f446daaea9d4 ("mm: implement writeback
livelock avoidance using page tagging"), and later modified/updated in
commit 6e6938b6d313 ("writeback: introduce .tagged_writepages for the
WB_SYNC_NONE sync stage").

The label has been removed in commit 64081362e8ff ("mm/page-writeback.c:
fix range_cyclic writeback vs writepages deadlock"), and the (identical)
checks are now present / performed immediately one after another.

So, remove/deduplicate the latter check, moving tag_pages_for_writeback()
into the former check before the 'tag' variable assignment, so it's clear
that it's not used in this (similarly-named) function call but only later
in pagevec_lookup_range_tag().

Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Andrew Morton
Reviewed-by: Ira Weiny
Reviewed-by: Andrew Morton
Cc: Jan Kara
Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.com
Signed-off-by: Linus Torvalds

Mauricio Faria de Oliveira
2020-04-03 00:35:26 +0800

14 Jan, 2020

3 commits

0a5d1a7f6 mm/page-writeback.c: improve arithmetic divisions ... Browse Code »

Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.

Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800
d3ac946ec mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide ... Browse Code »

The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)

And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
function to call is div64_ul().

Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800
6d9e8c651 mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() ... Browse Code »

Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".

We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.

201 if (min) {
202 min *= this_bw;
203 do_div(min, tot_bw);
204 }

And we also disassembled and confirmed it:

/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
0xffffffff811c37da : xor %r10d,%r10d
0xffffffff811c37dd : test %rax,%rax
0xffffffff811c37e0 : je 0xffffffff811c3800
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
0xffffffff811c37e2 : imul %r8,%rax
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
0xffffffff811c37e9 : xor %edx,%edx
0xffffffff811c37eb : div %r10
0xffffffff811c37ee : imul %rbx,%rax
0xffffffff811c37f2 : shr $0x2,%rax
0xffffffff811c37f6 : mul %rcx
0xffffffff811c37f9 : shr $0x2,%rdx
0xffffffff811c37fd : mov %rdx,%r10

This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

This patch (of 3):

The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division. Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800