Eric Lee / smarc-fsl-linux-kernel

12 Dec, 2020

2 commits

35dcb7ff8 Merge 7f376f1917d7 ("Merge tag 'mtd/fixes-for-5.10-rc8' of git://git.kernel.org/… ... Browse Code »

…pub/scm/linux/kernel/git/mtd/linux") into android-mainline

Steps on the way to 5.10-final

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iaf246e292babee342e89c090dad69e12e2cb4d75

Greg Kroah-Hartman
2020-12-12 19:19:56 +0800
16c0cc0ce revert "mm/filemap: add static for function __add_to_page_cache_locked" ... Browse Code »

Revert commit 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.

Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
Tested-by: Justin Forbes
Tested-by: Greg Thelen
Acked-by: Alexei Starovoitov
Cc: Michal Kubecek
Cc: Alex Shi
Cc: Souptick Joarder
Cc: Daniel Borkmann
Cc: Josef Bacik
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2020-12-12 06:02:14 +0800

09 Dec, 2020

1 commit

ca0c76873 Merge 5.10-rc7 into android-mainline ... Browse Code »

Linux 5.10-rc7

Signed-off-by: Greg Kroah-Hartman
Change-Id: Ie61b3510311a825ee57bee12610e25bc1500b350

Greg Kroah-Hartman
2020-12-09 15:09:26 +0800

07 Dec, 2020

1 commit

3351b16af mm/filemap: add static for function __add_to_page_cache_locked ... Browse Code »

mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

Signed-off-by: Alex Shi
Signed-off-by: Andrew Morton
Cc: Souptick Joarder
Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Alex Shi
2020-12-07 02:19:07 +0800

28 Nov, 2020

1 commit

dec0fd4a0 ANDROID: attribute page lock and waitqueue functions as sched ... Browse Code »

trace_sched_blocked_trace in CFS is really useful for debugging via
trace because it tell where the process was stuck on callstack.

For example,
-6143 ( 6136) [005] d..2 50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208
-6136 ( 6136) [005] d..2 50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0
-6142 ( 6136) [006] d..2 50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58
-6144 ( 6136) [006] d..2 50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104

However, sometime it gives pointless information like this.
RenderThread-2322 ( 1805) [006] d.s3 50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
logd.writer-594 ( 587) [002] d.s3 50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208
kworker/u16:13-333 ( 333) [007] d.s4 50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220

Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't
carry on higher information to identify the callstack.

The reason is page_lock and waitqueue are special synchronization method unlike
other normal locks(mutex, spinlock).
Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace
could detect it and skip them. It will produce more meaningful callstack
function like this.

-2867 ( 1068) [002] d.h4 124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470
-2867 ( 1068) [002] d.s3 124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104
-2867 ( 1068) [002] d.s4 124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
ScreenDecoratio-2364 ( 1867) [002] d.s3 124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc
ScreenDecoratio-2364 ( 1867) [002] d.s4 124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
-329 ( 329) [000] d..3 124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470
kworker/u16:13-538 ( 538) [007] d..3 124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470

Test: build pass and boot to home.
Bug: 144961676
Bug: 144713689
Bug: 172212772
Signed-off-by: Minchan Kim
Signed-off-by: Jimmy Shiu
Change-Id: I9c738802a16941ca767dcc37ae4463070b3fabf4
(cherry picked from commit 1e4de875d9e0cfaccf5131bcc709ae8646cdc168)
Signed-off-by: Will McVicker

Jimmy Shiu
2020-11-28 00:23:19 +0800

25 Nov, 2020

1 commit

073861ed7 mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) ... Browse Code »

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,

--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.

Devious"

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds

Hugh Dickins
2020-11-25 07:23:19 +0800

17 Nov, 2020

1 commit

0abed7c69 mm: never attempt async page lock if we've transferred data already ... Browse Code »

We catch the case where we enter generic_file_buffered_read() with data
already transferred, but we also need to be careful not to allow an async
page lock if we're looping transferring data. If not, we could be
returning -EIOCBQUEUED instead of the transferred amount, and it could
result in double waitqueue additions as well.

Cc: stable@vger.kernel.org # v5.9
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Signed-off-by: Jens Axboe

Jens Axboe
2020-11-17 04:39:34 +0800

24 Oct, 2020

1 commit

c4728cfbe Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull clone/dedupe/remap code refactoring from Darrick Wong:
"Move the generic file range remap (aka reflink and dedupe) functions
out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
reduce clutter in the first two files"

* tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: move the generic write and copy checks out of mm
vfs: move the remap range helpers to remap_range.c
vfs: move generic_remap_checks out of mm

Linus Torvalds
2020-10-24 02:33:41 +0800

18 Oct, 2020

1 commit

13bd69142 mm: mark async iocb read as NOWAIT once some data has been copied ... Browse Code »

Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
we should no longer attempt to async lock a new page. Instead make sure
we return the copied amount, and let the caller retry, instead of
returning -EIOCBQUEUED for a new page.

This should only be possible with read-ahead disabled on the below
device, and multiple threads racing on the same file. Haven't been able
to reproduce on anything else.

Cc: stable@vger.kernel.org # v5.9
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Reported-by: Kent Overstreet
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-18 03:49:05 +0800

17 Oct, 2020

5 commits

c4cf498dc Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"155 patches.

Subsystems affected by this patch series: mm (dax, debug, thp,
readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
romfs, and fault-injection"

* emailed patches from Andrew Morton : (155 commits)
lib, uaccess: add failure injection to usercopy functions
lib, include/linux: add usercopy failure capability
ROMFS: support inode blocks calculation
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
sched.h: drop in_ubsan field when UBSAN is in trap mode
scripts/gdb/tasks: add headers and improve spacing format
scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
kernel/relay.c: drop unneeded initialization
panic: dump registers on panic_on_warn
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
nilfs2: fix some kernel-doc warnings for nilfs2
autofs: harden ioctl table
ramfs: fix nommu mmap with gaps in the page cache
mm: remove the now-unnecessary mmget_still_valid() hack
mm/gup: take mmap_lock in get_dump_page()
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
coredump: refactor page range dumping into common helper
coredump: let dump_emit() bail out on short writes
...

Linus Torvalds
2020-10-17 02:31:55 +0800
0e9aa6755 mm: fix some broken comments ... Browse Code »

Fix some broken comments including typo, grammar error and wrong function
name.

Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200913095456.54873-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-17 02:11:19 +0800
db660d462 mm/filemap: fold ra_submit into do_sync_mmap_readahead ... Browse Code »

Fold ra_submit() into its last remaining user and pass the
readahead_control struct to both do_page_cache_ra() and
page_cache_sync_ra().

Signed-off-by: David Howells
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Eric Biggers
Link: https://lkml.kernel.org/r/20200903140844.14194-9-willy@infradead.org
Signed-off-by: Linus Torvalds

David Howells
2020-10-17 02:11:16 +0800
887b22c62 mm/filemap: fix page cache removal for arbitrary sized THPs ... Browse Code »

Patch series "Remove assumptions of THP size".

There are a number of places in the VM which assume that a THP is a PMD in
size. That's true today, and remains true after this patch series, but
this is a prerequisite for switching to arbitrary-sized THPs.
thp_nr_pages() still returns either HPAGE_PMD_NR or 1, but will be changed
later.

This patch (of 11):

page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200908195539.25896-2-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800
198b62f83 mm/filemap: fix storing to a THP shadow entry ... Browse Code »

When a THP is removed from the page cache by reclaim, we replace it with a
shadow entry that occupies all slots of the XArray previously occupied by
the THP. If the user then accesses that page again, we only allocate a
single page, but storing it into the shadow entry replaces all entries
with that one page. That leads to bugs like

page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
------------[ cut here ]------------
kernel BUG at mm/filemap.c:2529!

https://bugzilla.kernel.org/show_bug.cgi?id=206569

This is hard to reproduce with mainline, but happens regularly with the
THP patchset (as so many more THPs are created). This solution is take
from the THP patchset. It splits the shadow entry into order-0 pieces at
the time that we bring a new page into cache.

Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Song Liu
Cc: "Kirill A . Shutemov"
Cc: Qian Cai
Link: https://lkml.kernel.org/r/20200903183029.14930-4-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800

16 Oct, 2020

2 commits

9ff9b0d39 Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:

- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.

Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).

- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.

- Allow more than 255 IPv4 multicast interfaces.

- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.

- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.

- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.

- Allow more calls to same peer in RxRPC.

- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.

- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.

- Add TC actions for implementing MPLS L2 VPNs.

- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.

- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.

- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.

- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.

- Support sleepable BPF programs, initially targeting LSM and tracing.

- Add bpf_d_path() helper for returning full path for given 'struct
path'.

- Make bpf_tail_call compatible with bpf-to-bpf calls.

- Allow BPF programs to call map_update_elem on sockmaps.

- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).

- Support listing and getting information about bpf_links via the bpf
syscall.

- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).

- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.

- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).

- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.

- Add XDP support for Intel's igb driver.

- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.

- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.

- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.

- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.

- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.

- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.

- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.

- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.

- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.

- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...

Linus Torvalds
2020-10-16 09:42:13 +0800
407e9c63e vfs: move the generic write and copy checks out of mm ... Browse Code »

The generic write check helpers also don't have much to do with the page
cache, so move them to the vfs.

Signed-off-by: Darrick J. Wong

Darrick J. Wong
2020-10-16 00:50:01 +0800

15 Oct, 2020

1 commit

02e83f46e vfs: move generic_remap_checks out of mm ... Browse Code »

I would like to move all the generic helpers for the vfs remap range
functionality (aka clonerange and dedupe) into a separate file so that
they won't be scattered across the vfs and the mm subsystems. The
eventual goal is to be able to deselect remap_range.c if none of the
filesystems need that code, but the tricky part here is picking a
stable(ish) part of the merge window to rearrange code.

Signed-off-by: Darrick J. Wong

Darrick J. Wong
2020-10-15 07:47:08 +0800

14 Oct, 2020

6 commits

27a83a609 mm/filemap: fix filemap_map_pages for THP ... Browse Code »

We dereference page->mapping and page->index directly after calling
find_subpage() and these fields are not valid for tail pages. While
commit 4101196b19d7 ("mm: page cache: store only head pages in i_pages")
introduced the call to find_subpage(), the problem existed prior to this;
I'm going to suggest all the way back to when THPs first existed.

The user-visible effects of this are almost negligible. To hit it, you
have to mmap a tmpfs file at an unaligned address and then it's only a
disabled optimisation causing page faults to happen more frequently than
they otherwise would.

Fix this by keeping both head and page pointers and checking the
appropriate one. We could use page_mapping() and page_to_index(), but
that's higher overhead.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200911012532.24761-1-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
a8cf7f272 mm: add find_lock_head ... Browse Code »

Add a new FGP_HEAD flag which avoids calling find_subpage() and add a
convenience wrapper for it.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Alexey Dobriyan
Cc: Chris Wilson
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Jani Nikula
Cc: Johannes Weiner
Cc: Matthew Auld
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
63ec1973d mm/shmem: return head page from find_lock_entry ... Browse Code »

Convert shmem_getpage_gfp() (the only remaining caller of
find_lock_entry()) to cope with a head page being returned instead of
the subpage for the index.

[willy@infradead.org: fix BUG()s]
Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Alexey Dobriyan
Cc: Chris Wilson
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Jani Nikula
Cc: Johannes Weiner
Cc: Matthew Auld
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
a6de4b487 mm: convert find_get_entry to return the head page ... Browse Code »

There are only four callers remaining of find_get_entry().
get_shadow_from_swap_cache() only wants to see shadow entries and doesn't
care about which page is returned. Push the find_subpage() call into
find_lock_entry(), find_get_incore_page() and pagecache_get_page().

[willy@infradead.org: fix oops]
Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Alexey Dobriyan
Cc: Chris Wilson
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Jani Nikula
Cc: Johannes Weiner
Cc: Matthew Auld
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
9dfc8ff34 i915: use find_lock_page instead of find_lock_entry ... Browse Code »

i915 does not want to see value entries. Switch it to use
find_lock_page() instead, and remove the export of find_lock_entry().
Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage
any future use.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Cc: Alexey Dobriyan
Cc: Chris Wilson
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Jani Nikula
Cc: Matthew Auld
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
3ad11d7ac Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:

- Series of merge handling cleanups (Baolin, Christoph)

- Series of blk-throttle fixes and cleanups (Baolin)

- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)

- Removal of bdget() as a generic API (Christoph)

- Removal of blkdev_get() as a generic API (Christoph)

- Cleanup of is-partition checks (Christoph)

- Series reworking disk revalidation (Christoph)

- Series cleaning up bio flags (Christoph)

- bio crypt fixes (Eric)

- IO stats inflight tweak (Gabriel)

- blk-mq tags fixes (Hannes)

- Buffer invalidation fixes (Jan)

- Allow soft limits for zone append (Johannes)

- Shared tag set improvements (John, Kashyap)

- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

- DM no-wait support (Mike, Konstantin)

- Request allocation improvements (Ming)

- Allow md/dm/bcache to use IO stat helpers (Song)

- Series improving blk-iocost (Tejun)

- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...

Linus Torvalds
2020-10-14 03:12:44 +0800

06 Oct, 2020

1 commit

8b0308fe3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller

David S. Miller
2020-10-06 09:40:01 +0800

03 Oct, 2020

1 commit

702bfc891 Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring fixes from Jens Axboe:

- fix for async buffered reads if read-ahead is fully disabled (Hao)

- double poll match fix

- ->show_fdinfo() potential ABBA deadlock complaint fix

* tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match

Linus Torvalds
2020-10-03 05:38:10 +0800

29 Sep, 2020

1 commit

c8d317aa1 io_uring: fix async buffered reads when readahead is disabled ... Browse Code »

The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:

- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.

- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.

Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu
Signed-off-by: Jens Axboe

Hao Xu
2020-09-29 21:54:00 +0800

25 Sep, 2020

1 commit

f56753ac2 bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag ... Browse Code »

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800

23 Sep, 2020

1 commit

3ab0a7a0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.

Signed-off-by: David S. Miller

David S. Miller
2020-09-23 07:45:34 +0800

21 Sep, 2020

1 commit

5868ec267 mm: fix wake_page_function() comment typos ... Browse Code »

Sedat Dilek pointed out some silly comment typo issues.

Reported-by: Sedat Dilek
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-09-21 01:38:47 +0800

18 Sep, 2020

1 commit

5ef64cc89 mm: allow a controlled amount of unfairness in the page lock ... Browse Code »

Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.

That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.

It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.

Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.

There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.

The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.

This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.

Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel
Tested-by: Matthieu Baerts
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Chris Mason
Cc: Jan Kara
Cc: Amir Goldstein
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-09-18 01:26:41 +0800

29 Aug, 2020

1 commit

76cd61739 mm/error_inject: Fix allow_error_inject function signatures. ... Browse Code »

'static' and 'static noinline' function attributes make no guarantees that
gcc/clang won't optimize them. The compiler may decide to inline 'static'
function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler
could have inlined __add_to_page_cache_locked() in one callsite and didn't
inline in another. In such case injecting errors into it would cause
unpredictable behavior. It's worse with 'static noinline' which won't be
inlined, but it still can be optimized. Like the compiler may decide to remove
one argument or constant propagate the value depending on the callsite.

To avoid such issues make sure that these functions are global noinline.

Fixes: af3b854492f3 ("mm/page_alloc.c: allow error injection")
Fixes: cfcbfb1382db ("mm/filemap.c: enable error injection at add_to_page_cache()")
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Reviewed-by: Josef Bacik
Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com

Alexei Starovoitov
2020-08-29 03:20:32 +0800

15 Aug, 2020

2 commits

e630bfac7 mm/filemap.c: fix a data race in filemap_fault() ... Browse Code »

struct file_ra_state ra.mmap_miss could be accessed concurrently during
page faults as noticed by KCSAN,

BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
filemap_fault+0x920/0xfc0
do_sync_mmap_readahead at mm/filemap.c:2384
(inlined by) filemap_fault at mm/filemap.c:2486
__xfs_filemap_fault+0x112/0x3e0 [xfs]
xfs_filemap_fault+0x74/0x90 [xfs]
__do_fault+0x9e/0x220
do_fault+0x4a0/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40

read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
filemap_map_pages+0xc2e/0xd80
filemap_map_pages at mm/filemap.c:2625
do_fault+0x3da/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40

Reported by Kernel Concurrency Sanitizer on:
CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

ra.mmap_miss is used to contribute the readahead decisions, a data race
could be undesirable. Both the read and write is only under non-exclusive
mmap_sem, two concurrent writers could even underflow the counter. Fix
the underflow by writing to a local variable before committing a final
store to ra.mmap_miss given a small inaccuracy of the counter should be
acceptable.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Tested-by: Qian Cai
Reviewed-by: Matthew Wilcox (Oracle)
Cc: Marco Elver
Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-08-15 10:56:57 +0800
6c357848b mm: replace hpage_nr_pages with thp_nr_pages ... Browse Code »

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-08-15 10:56:56 +0800

13 Aug, 2020

1 commit

ce89fddfe mm/filemap.c: delete duplicated word ... Browse Code »

Drop the repeated word "the".

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
Signed-off-by: Linus Torvalds

Randy Dunlap
2020-08-13 01:57:58 +0800

08 Aug, 2020

2 commits

605cad834 mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page ... Browse Code »

FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
comment.

Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Cc: Gang Deng
Cc: Shakeel Butt
Cc: Johannes Weiner
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-08-08 02:33:23 +0800
b9306a796 mm: filemap: clear idle flag for writes ... Browse Code »

Since commit bbddabe2e436aa ("mm: filemap: only do access activations on
reads"), mark_page_accessed() is called for reads only. But the idle flag
is cleared by mark_page_accessed() so the idle flag won't get cleared if
the page is write accessed only.

Basically idle page tracking is used to estimate workingset size of
workload, noticeable size of workingset might be missed if the idle flag
is not maintained correctly.

It seems good enough to just clear idle flag for write operations.

Fixes: bbddabe2e436 ("mm: filemap: only do access activations on reads")
Reported-by: Gang Deng
Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/1593020612-13051-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-08-08 02:33:23 +0800

04 Aug, 2020

1 commit

cdc8fcb49 Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:

- Cleanup how memory accounting is done on ring setup/free (Bijan)

- sq array offset calculation fixup (Dmitry)

- Consistently handle blocking off O_DIRECT submission path (me)

- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)

- IO completion optimizations (me)

- Fix race with accounting and ring fd install (me)

- Support EPOLLEXCLUSIVE (Jiufei)

- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)

- Completion side cleanups (Pavel)

- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

- Request environment grabbing cleanups (Pavel)

- File and socket read/write cleanups (Pavel)

- Improve kiocb_set_rw_flags() (Pavel)

- Tons of fixes and cleanups (Pavel)

- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...

Linus Torvalds
2020-08-04 04:01:22 +0800

03 Aug, 2020

2 commits

c6fe44d96 list: add "list_del_init_careful()" to go with "list_empty_careful()" ... Browse Code »

That gives us ordering guarantees around the pair.

Signed-off-by: Linus Torvalds

Linus Torvalds
2020-08-03 11:39:44 +0800
2a9127fcf mm: rewrite wait_on_page_bit_common() logic ... Browse Code »

It turns out that wait_on_page_bit_common() had several problems,
ranging from just unfair behavioe due to re-queueing at the end of the
wait queue when re-trying, and an outright bug that could result in
missed wakeups (but probably never happened in practice).

This rewrites the whole logic to avoid both issues, by simply moving the
logic to check (and possibly take) the bit lock into the wakeup path
instead.

That makes everything much more straightforward, and means that we never
need to re-queue the wait entry: if we get woken up, we'll be notified
through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
and everything will have been done for us.

Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
Reported-by: Oleg Nesterov
Reported-by: Hugh Dickins
Cc: Michal Hocko
Reviewed-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-08-03 11:39:44 +0800

08 Jul, 2020

1 commit

41da51bce fs: Add IOCB_NOIO flag for generic_file_read_iter ... Browse Code »

Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
shouldn't trigger any filesystem I/O for the actual request or for
readahead. This allows to do tentative reads out of the page cache as
some filesystems allow, and to take the appropriate locks and retry the
reads only if the requested pages are not cached.

Signed-off-by: Andreas Gruenbacher

Andreas Gruenbacher
2020-07-08 05:40:08 +0800