Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2020

25 commits

ab88dca31 nfs: get rid of mount_info ->fill_super() ... Browse Code »

The only possible values are nfs_fill_super and nfs_clone_super. The
latter is used only when crossing into a submount and it is almost
identical to the former; the only differences are
* ->s_time_gran unconditionally set to 1 (even for v2 mounts).
Regression dating back to 2012, actually.
* ->s_blocksize/->s_blocksize_bits set to that of parent.

Rather than messing with the method, stash ->s_blocksize_bits in
mount_info in submount case and after the (now unconditional)
call of nfs_fill_super() override ->s_blocksize/->s_blocksize_bits
if that has been set.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
0c38f2131 nfs: don't pass nfs_subversion to ->create_server() ... Browse Code »

pick it from mount_info

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
1bc3a2cbf nfs: unexport nfs_fs_mount_common() ... Browse Code »

Make it static, even. And remove a stale extern of (long-gone)
nfs_xdev_mount_common() from internal.h, while we are at it.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
82eaed2be nfs: merge xdev and remote file_system_type ... Browse Code »

they are identical now...

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
a55d3297b nfs: don't bother passing nfs_subversion to ->try_mount() and nfs_fs_mount_common() ... Browse Code »

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
6a3f7a399 nfs: stash nfs_subversion reference into nfs_mount_info ... Browse Code »

That will allow to get rid of passing those references around in
quite a few places. Moreover, that will allow to merge xdev and
remote file_system_type.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
250d69f6a nfs: lift setting mount_info from nfs_xdev_mount() ... Browse Code »

Do it in nfs_do_submount() instead. As a side benefit, nfs_clone_data
doesn't need ->fh and ->fattr anymore.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
4e357761b nfs4: fold nfs_do_root_mount/nfs_follow_remote_path ... Browse Code »

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
6654f8e24 nfs: don't bother setting/restoring export_path around do_nfs_root_mount() ... Browse Code »

nothing in it will be looking at that thing anyway

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
15a9c4eff nfs: fold nfs4_remote_fs_type and nfs4_remote_referral_fs_type ... Browse Code »

They are identical now.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
7643c12e9 nfs: lift setting mount_info from nfs4_remote{,_referral}_mount ... Browse Code »

Do that (fhandle allocation, setting struct server up) in
nfs4_referral_mount() and nfs4_try_mount() resp. and pass the
server and pointer to mount_info into nfs_do_root_mount() so that
nfs4_remote_referral_mount()/nfs_remote_mount() could be merged.

Since we are moving stuff from ->mount() instances to the points
prior to vfs_kern_mount() that would trigger those, we need to
make sure that do_nfs_root_mount() will do the corresponding
cleanup itself if it doesn't trigger those ->mount() instances.

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
d0b779d47 nfs: stash server into struct nfs_mount_info ... Browse Code »

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
444a52960 saner calling conventions for nfs_fs_mount_common() ... Browse Code »

Allow it to take ERR_PTR() for server and return ERR_CAST() of it in
such case. All callers used to open-code that...

Reviewed-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Anna Schumaker

Al Viro
2020-01-15 23:15:16 +0800
95e20af9f Merge tag 'nfs-for-5.5-2' of git://git.linux-nfs.org/projects/anna/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Anna Schumaker:
"Three NFS over RDMA fixes for bugs Chuck found that can be hit during
device removal:

- Fix create_qp crash on device unload

- Fix completion wait during device removal

- Fix oops in receive handler after device removal"

* tag 'nfs-for-5.5-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
xprtrdma: Fix oops in Receive handler after device removal
xprtrdma: Fix completion wait during device removal
xprtrdma: Fix create_qp crash on device unload

Linus Torvalds
2020-01-15 05:33:14 +0800
671c450b6 xprtrdma: Fix oops in Receive handler after device removal ... Browse Code »

Since v5.4, a device removal occasionally triggered this oops:

Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
Dec 2 17:13:53 manet kernel: PGD 0 P4D 0
Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883
Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
Dec 2 17:13:53 manet kernel: Call Trace:
Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9
Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
is still pointing to the old ib_device, which has been freed. The
only way that is possible is if this rpcrdma_rep was not destroyed
by rpcrdma_ia_remove.

Debugging showed that was indeed the case: this rpcrdma_rep was
still in use by a completing RPC at the time of the device removal,
and thus wasn't on the rep free list. So, it was not found by
rpcrdma_reps_destroy().

The fix is to introduce a list of all rpcrdma_reps so that they all
can be found when a device is removed. That list is used to perform
only regbuf DMA unmapping, replacing that call to
rpcrdma_reps_destroy().

Meanwhile, to prevent corruption of this list, I've moved the
destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
protecting the rb_all_reps list.

Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker

Chuck Lever
2020-01-15 02:30:24 +0800
13cb886c5 xprtrdma: Fix completion wait during device removal ... Browse Code »

I've found that on occasion, "rmmod " will hang while if an NFS
is under load.

Ensure that ri_remove_done is initialized only just before the
transport is woken up to force a close. This avoids the completion
possibly getting initialized again while the CM event handler is
waiting for a wake-up.

Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount")
Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker

Chuck Lever
2020-01-15 02:30:24 +0800
b32b9ed49 xprtrdma: Fix create_qp crash on device unload ... Browse Code »

On device re-insertion, the RDMA device driver crashes trying to set
up a new QP:

Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852
Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
Nov 27 16:32:06 manet kernel: Call Trace:
Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

The fix is to copy the qp_init_attr struct that was just created by
rpcrdma_ep_create() instead of using the one from the previous
connection instance.

Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect")
Signed-off-by: Chuck Lever
Signed-off-by: Anna Schumaker

Chuck Lever
2020-01-15 02:30:24 +0800
452424cdc Merge branch 'parisc-5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux ... Browse Code »

Pull parisc fixes from Helge Deller:
"A boot crash fix by Mike Rapoport and a printk fix by Krzysztof
Kozlowski"

* 'parisc-5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: fix map_pages() to actually populate upper directory
parisc: Use proper printk format for resource_size_t

Linus Torvalds
2020-01-15 02:22:10 +0800
67373994d Merge tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull asm-generic fixes from Arnd Bergmann:
"Here are two bugfixes from Mike Rapoport, both fixing compile-time
errors for the nds32 architecture that were recently introduced"

* tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
nds32: fix build failure caused by page table folding updates
asm-generic/nds32: don't redefine cacheflush primitives

Linus Torvalds
2020-01-15 02:17:15 +0800
c21ed4d9a Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fixes from James Bottomley:
"Two simple fixes in the upper drivers (so both fairly core), one in
enclosures, which fixes replugging a device into an enclosure slot and
one in the disk driver which fixes revalidating a drive with
protection information (PI) to make it a non-PI drive ... previously
we were still remembering the old PI state.

Both fixed issues are quite rare in the field"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: enclosure: Fix stale device oops with hot replug
scsi: sd: Clear sdkp->protection_type if disk is reformatted without PI

Linus Torvalds
2020-01-15 02:14:06 +0800
e033e7d4a Merge branch 'dhowells' (patches from DavidH) ... Browse Code »

Merge misc fixes from David Howells.

Two afs fixes and a key refcounting fix.

* dhowells:
afs: Fix afs_lookup() to not clobber the version on a new dentry
afs: Fix use-after-loss-of-ref
keys: Fix request_key() cache

Linus Torvalds
2020-01-15 01:56:31 +0800
f52b83b0b afs: Fix afs_lookup() to not clobber the version on a new dentry ... Browse Code »

Fix afs_lookup() to not clobber the version set on a new dentry by
afs_do_lookup() - especially as it's using the wrong version of the
version (we need to use the one given to us by whatever op the dir
contents correspond to rather than what's in the afs_vnode).

Fixes: 9dd0b82ef530 ("afs: Fix missing dentry data version updating")
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

David Howells
2020-01-15 01:40:06 +0800
40a708bd6 afs: Fix use-after-loss-of-ref ... Browse Code »

afs_lookup() has a tracepoint to indicate the outcome of
d_splice_alias(), passing it the inode to retrieve the fid from.
However, the function gave up its ref on that inode when it called
d_splice_alias(), which may have failed and dropped the inode.

Fix this by caching the fid.

Fixes: 80548b03991f ("afs: Add more tracepoints")
Reported-by: Al Viro
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

David Howells
2020-01-15 01:40:06 +0800
8379bb84b keys: Fix request_key() cache ... Browse Code »

When the key cached by request_key() and co. is cleaned up on exit(),
the code looks in the wrong task_struct, and so clears the wrong cache.
This leads to anomalies in key refcounting when doing, say, a kernel
build on an afs volume, that then trigger kasan to report a
use-after-free when the key is viewed in /proc/keys.

Fix this by making exit_creds() look in the passed-in task_struct rather
than in current (the task_struct cleanup code is deferred by RCU and
potentially run in another task).

Fixes: 7743c48e54ee ("keys: Cache result of request_key*() temporarily in task_struct")
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

David Howells
2020-01-15 01:40:06 +0800
3f1f9a9b7 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"11 mm fixes"

* emailed patches from Andrew Morton :
mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE
mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
mm/page-writeback.c: improve arithmetic divisions
mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide
mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
mm, debug_pagealloc: don't rely on static keys too early
mm: memcg/slab: fix percpu slab vmstats flushing
mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
mm/memory_hotplug: don't free usage map when removing a re-added early section
mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

Linus Torvalds
2020-01-15 01:22:51 +0800

14 Jan, 2020

14 commits

8b7f938e0 parisc: fix map_pages() to actually populate upper directory ... Browse Code »

The commit d96885e277b5 ("parisc: use pgtable-nopXd instead of
4level-fixup") converted PA-RISC to use folded page tables, but it missed
the conversion of pgd_populate() to pud_populate() in maps_pages()
function. This caused the upper page table directory to remain empty and
the system would crash as a result.

Using pud_populate() that actually populates the page table instead of
dummy pgd_populate() fixes the issue.

Fixes: d96885e277b5 ("parisc: use pgtable-nopXd instead of 4level-fixup")
Reported-by: Meelis Roos
Reported-by: Jeroen Roovers
Reported-by: Mikulas Patocka
Tested-by: Jeroen Roovers
Tested-by: Mikulas Patocka
Signed-off-by: Mike Rapoport
Signed-off-by: Helge Deller

Mike Rapoport
2020-01-14 16:18:39 +0800
4f80b70e1 parisc: Use proper printk format for resource_size_t ... Browse Code »

resource_size_t should be printed with its own size-independent format
to fix warnings when compiling on 64-bit platform (e.g. with
COMPILE_TEST):

arch/parisc/kernel/drivers.c: In function 'print_parisc_device':
arch/parisc/kernel/drivers.c:892:9: warning:
format '%p' expects argument of type 'void *',
but argument 4 has type 'resource_size_t {aka unsigned int}' [-Wformat=]

Signed-off-by: Krzysztof Kozlowski
Signed-off-by: Helge Deller

Krzysztof Kozlowski
2020-01-14 16:17:59 +0800
63d264fe0 Merge tag 'Intel-CVE-2019-14615' from bundle by Akeem Abodunrin. ... Browse Code »

Merge Intel Gen9 graphics fix from Akeem Abodunrin:
"Insufficient control flow in certain data structures for some Intel
Processors with Intel Processor Graphics may allow an unauthenticated
user to potentially enable information disclosure via local access

This provides mitigation for Gen9 hardware. Note that Gen8 is not
impacted due to a previously implemented workaround.

The mitigation involves using an existing hardware feature to forcibly
clear down all EU state at each context switch"

* tag 'Intel-CVE-2019-14615' of emailed bundle from Akeem G Abodunrin :
drm/i915/gen9: Clear residual context state on context switch

Linus Torvalds
2020-01-14 10:40:57 +0800
554913f60 mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE ... Browse Code »

Commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem)
FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
the corresponding description for trace events were not added.

Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Yang Shi
Cc: Song Liu
Cc: Kirill A. Shutemov
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2020-01-14 10:19:02 +0800
2fe20210f mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid ... Browse Code »

When booting with amd_iommu=off, the following WARNING message
appears:

AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang
Reported-by: Xiaochun Lee
Reviewed-by: Shakeel Butt
Cc: Joerg Roedel
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Huang
2020-01-14 10:19:02 +0800
0a5d1a7f6 mm/page-writeback.c: improve arithmetic divisions ... Browse Code »

Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.

Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800
d3ac946ec mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide ... Browse Code »

The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)

And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
function to call is div64_ul().

Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800
6d9e8c651 mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() ... Browse Code »

Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".

We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.

201 if (min) {
202 min *= this_bw;
203 do_div(min, tot_bw);
204 }

And we also disassembled and confirmed it:

/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
0xffffffff811c37da : xor %r10d,%r10d
0xffffffff811c37dd : test %rax,%rax
0xffffffff811c37e0 : je 0xffffffff811c3800
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
0xffffffff811c37e2 : imul %r8,%rax
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
0xffffffff811c37e9 : xor %edx,%edx
0xffffffff811c37eb : div %r10
0xffffffff811c37ee : imul %rbx,%rax
0xffffffff811c37f2 : shr $0x2,%rax
0xffffffff811c37f6 : mul %rcx
0xffffffff811c37f9 : shr $0x2,%rdx
0xffffffff811c37fd : mov %rdx,%r10

This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

This patch (of 3):

The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division. Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2020-01-14 10:19:02 +0800
8e57f8acb mm, debug_pagealloc: don't rely on static keys too early ... Browse Code »

Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.

However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.

To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.

[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka
Signed-off-by: Stephen Rothwell
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Qian Cai
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2020-01-14 10:19:02 +0800
4a87e2a25 mm: memcg/slab: fix percpu slab vmstats flushing ... Browse Code »

Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-01-14 10:19:02 +0800
991589974 mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment ... Browse Code »

Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
enabled. But it doesn't work well with above-47bit hint address.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks THP alignment in shmem/tmp:
shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
*any* hint address specified.

This can be fixed by requesting the aligned area if the we failed to
allocated at user-specified hint address. The request with inflated
length will also take the user-specified hint address. This way we will
not lose an allocation request from the full address space.

[kirill@shutemov.name: fold in a fixup]
Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov
Cc: "Willhalm, Thomas"
Cc: Dan Williams
Cc: "Bruggeman, Otto G"
Cc: "Aneesh Kumar K . V"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-01-14 10:19:01 +0800
97d3d0f9a mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment ... Browse Code »

Patch series "Fix two above-47bit hint address vs. THP bugs".

The two get_unmapped_area() implementations have to be fixed to provide
THP-friendly mappings if above-47bit hint address is specified.

This patch (of 2):

Filesystems use thp_get_unmapped_area() to provide THP-friendly
mappings. For DAX in particular.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks thp_get_unmapped_area(): the function
would not try to allocate PMD-aligned area if *any* hint address
specified.

Modify the routine to handle it correctly:

- Try to allocate the space at the specified hint address with length
padding required for PMD alignment.
- If failed, retry without length padding (but with the same hint
address);
- If the returned address matches the hint address return it.
- Otherwise, align the address as required for THP and return.

The user specified hint address is passed down to get_unmapped_area() so
above-47bit hint address will be taken into account without breaking
alignment requirements.

Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov
Reported-by: Thomas Willhalm
Tested-by: Dan Williams
Cc: "Aneesh Kumar K . V"
Cc: "Bruggeman, Otto G"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-01-14 10:19:01 +0800
8068df3b6 mm/memory_hotplug: don't free usage map when removing a re-added early section ... Browse Code »

When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.

Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.

Can be triggered using memtrace under ppc64/powernv:

$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
------------[ cut here ]------------
kernel BUG at mm/slub.c:3969!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
NIP kfree+0x338/0x3b0
LR section_deactivate+0x138/0x200
Call Trace:
section_deactivate+0x138/0x200
__remove_pages+0x114/0x150
arch_remove_memory+0x3c/0x160
try_remove_memory+0x114/0x1a0
__remove_memory+0x20/0x40
memtrace_enable_set+0x254/0x850
simple_attr_write+0x138/0x160
full_proxy_write+0x8c/0x110
__vfs_write+0x38/0x70
vfs_write+0x11c/0x2a0
ksys_write+0x84/0x140
system_call+0x5c/0x68
---[ end trace 4b053cbd84e0db62 ]---

The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").

Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.

Using Dynamic Memory under a Power DLAPR can trigger it easily.

Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.

Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand
Tested-by: Pingfan Liu
Cc: Dan Williams
Cc: Oscar Salvador
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-01-14 10:19:01 +0800
cc638f329 mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations ... Browse Code »

THP page faults now attempt a __GFP_THISNODE allocation first, which
should only compact existing free memory, followed by another attempt
that can allocate from any node using reclaim/compaction effort
specified by global defrag setting and madvise.

This patch makes the following changes to the scheme:

- Before the patch, the first allocation relies on a check for
pageblock order and __GFP_IO to prevent excessive reclaim. This
however affects also the second attempt, which is not limited to
single node.

Instead of that, reuse the existing check for costly order
__GFP_NORETRY allocations, and make sure the first THP attempt uses
__GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
allocations will bail out if compaction needs reclaim, while
previously they only bailed out when compaction was deferred due to
previous failures.

This should be still acceptable within the __GFP_NORETRY semantics.

- Before the patch, the second allocation attempt (on all nodes) was
passing __GFP_NORETRY. This is redundant as the check for pageblock
order (discussed above) was stronger. It's also contrary to
madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
requested.

After this patch, the second attempt doesn't pass __GFP_THISNODE nor
__GFP_NORETRY.

To sum up, THP page faults now try the following attempts:

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
allocation from any node with effort determined by global defrag setting
and VMA madvise
3. fallback to base pages on any node

Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Linus Torvalds
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2020-01-14 10:19:01 +0800

13 Jan, 2020

1 commit

b3a987b02 Linux 5.5-rc6 Browse Code »

Linus Torvalds
2020-01-13 08:55:08 +0800