Eric Lee / smarc-fsl-linux-kernel

13 Mar, 2020

1 commit

1b51f6946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
liner off-by-one and similar type changes:

1) Fix netlink header pointer to calcular bad attribute offset
reported to user. From Pablo Neira Ayuso.

2) Don't double clear PHY interrupts when ->did_interrupt is set,
from Heiner Kallweit.

3) Add missing validation of various (devlink, nl802154, fib, etc.)
attributes, from Jakub Kicinski.

4) Missing *pos increments in various netfilter seq_next ops, from
Vasily Averin.

5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

7) Work around FMAN erratum A050385, from Madalin Bucur.

8) Make sure ARP header is pulled early enough in bonding driver,
from Eric Dumazet.

9) Do a cond_resched() during multicast processing of ipvlan and
macvlan, from Mahesh Bandewar.

10) Don't attach cgroups to unrelated sockets when in interrupt
context, from Shakeel Butt.

11) Fix tpacket ring state management when encountering unknown GSO
types. From Willem de Bruijn.

12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
only in the suspend context. From Heiner Kallweit"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
net: systemport: fix index check to avoid an array out of bounds access
tc-testing: add ETS scheduler to tdc build configuration
net: phy: fix MDIO bus PM PHY resuming
net: hns3: clear port base VLAN when unload PF
net: hns3: fix RMW issue for VLAN filter switch
net: hns3: fix VF VLAN table entries inconsistent issue
net: hns3: fix "tc qdisc del" failed issue
taprio: Fix sending packets without dequeueing them
net: mvmdio: avoid error message for optional IRQ
net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
net: memcg: fix lockdep splat in inet_csk_accept()
s390/qeth: implement smarter resizing of the RX buffer pool
s390/qeth: refactor buffer pool code
s390/qeth: use page pointers to manage RX buffer pool
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
net/packet: tpacket_rcv: do not increment ring index on drop
sxgbe: Fix off by one in samsung driver strncpy size arg
net: caif: Add lockdep expression to RCU traversal primitive
MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
...

Linus Torvalds
2020-03-13 07:19:19 +0800

11 Mar, 2020

2 commits

d752a4986 net: memcg: late association of sock to memcg ... Browse Code »

If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.

This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.

To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.

Signed-off-by: Shakeel Butt
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Shakeel Butt
2020-03-11 06:33:05 +0800
e876ecc67 cgroup: memcg: net: do not associate sock with unrelated cgroup ... Browse Code »

We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.

mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.

Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.

WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:

dump_stack+0x57/0x75
mem_cgroup_sk_alloc+0xe9/0xf0
sk_clone_lock+0x2a7/0x420
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x23/0x3b0
tcp_v6_syn_recv_sock+0x88/0x730
tcp_check_req+0x429/0x560
tcp_v6_rcv+0x72d/0xa40
ip6_protocol_deliver_rcu+0xc9/0x400
ip6_input+0x44/0xd0
? ip6_protocol_deliver_rcu+0x400/0x400
ip6_rcv_finish+0x71/0x80
ipv6_rcv+0x5b/0xe0
? ip6_sublist_rcv+0x2e0/0x2e0
process_backlog+0x108/0x1e0
net_rx_action+0x26b/0x460
__do_softirq+0x104/0x2a6
do_softirq_own_stack+0x2a/0x40

do_softirq.part.19+0x40/0x50
__local_bh_enable_ip+0x51/0x60
ip6_finish_output2+0x23d/0x520
? ip6table_mangle_hook+0x55/0x160
__ip6_finish_output+0xa1/0x100
ip6_finish_output+0x30/0xd0
ip6_output+0x73/0x120
? __ip6_finish_output+0x100/0x100
ip6_xmit+0x2e3/0x600
? ipv6_anycast_cleanup+0x50/0x50
? inet6_csk_route_socket+0x136/0x1e0
? skb_free_head+0x1e/0x30
inet6_csk_xmit+0x95/0xf0
__tcp_transmit_skb+0x5b4/0xb20
__tcp_send_ack.part.60+0xa3/0x110
tcp_send_ack+0x1d/0x20
tcp_rcv_state_process+0xe64/0xe80
? tcp_v6_connect+0x5d1/0x5f0
tcp_v6_do_rcv+0x1b1/0x3f0
? tcp_v6_do_rcv+0x1b1/0x3f0
__release_sock+0x7f/0xd0
release_sock+0x30/0xa0
__inet_stream_connect+0x1c3/0x3b0
? prepare_to_wait+0xb0/0xb0
inet_stream_connect+0x3b/0x60
__sys_connect+0x101/0x120
? __sys_getsockopt+0x11b/0x140
__x64_sys_connect+0x1a/0x20
do_syscall_64+0x51/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Signed-off-by: David S. Miller

Shakeel Butt
2020-03-11 06:33:05 +0800

06 Mar, 2020

5 commits

c87cbc1f0 mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled ... Browse Code »

Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
fixed memory hotplug with debug_pagealloc enabled, where onlining a page
goes through page freeing, which removes the direct mapping. Some arches
don't like when the page is not mapped in the first place, so
generic_online_page() maps it first. This is somewhat wasteful, but
better than special casing page freeing fast paths.

The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
it's actually enabled. One has to test debug_pagealloc_enabled() since
031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
configurable"), or alternatively debug_pagealloc_enabled_static() since
8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
but this is not done.

As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
will crash:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
0000000000000001 0000000000000000 0000000000000002 0000000000000100
0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
>0000001ecd281b9e: 94fb5006 ni 6(%r5),251
0000001ecd281ba2: 41505008 la %r5,8(%r5)
0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
0000001ecd281bac: 1a07 ar %r0,%r7
0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
Call Trace:
[] __kernel_map_pages+0x166/0x188
[] online_pages_range+0xf6/0x128
[] walk_system_ram_range+0x7e/0xd8
[] online_pages+0x2fe/0x3f0
[] memory_subsys_online+0x8e/0xc0
[] device_online+0x5a/0xc8
[] state_store+0x88/0x118
[] kernfs_fop_write+0xc2/0x200
[] vfs_write+0x176/0x1e0
[] ksys_write+0xa2/0x100
[] system_call+0xd8/0x2c8

Fix this by checking debug_pagealloc_enabled_static() before calling
kernel_map_pages(). Backports for kernel before 5.5 should use
debug_pagealloc_enabled() instead. Also add comments.

Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
Reported-by: Gerald Schaefer
Signed-off-by: Andrew Morton
Signed-off-by: Vlastimil Babka
Reviewed-by: David Hildenbrand
Cc:
Cc: Joonsoo Kim
Cc: Qian Cai
Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds

Vlastimil Babka
2020-03-06 21:06:09 +0800
a8198fedd mm/z3fold.c: do not include rwlock.h directly ... Browse Code »

rwlock.h should not be included directly. Instead linux/splinlock.h
should be included. One thing it does is to break the RT build.

Signed-off-by: Andrew Morton
Signed-off-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Vitaly Wool
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2020-03-06 21:06:09 +0800
c3e5ea6ee mm: avoid data corruption on CoW fault into PFN-mapped VMA ... Browse Code »

Jeff Moyer has reported that one of xfstests triggers a warning when run
on DAX-enabled filesystem:

WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
...
wp_page_copy+0x98c/0xd50 (unreliable)
do_wp_page+0xd8/0xad0
__handle_mm_fault+0x748/0x1b90
handle_mm_fault+0x120/0x1f0
__do_page_fault+0x240/0xd70
do_page_fault+0x38/0xd0
handle_page_fault+0x10/0x30

The warning happens on failed __copy_from_user_inatomic() which tries to
copy data into a CoW page.

This happens because of race between MADV_DONTNEED and CoW page fault:

CPU0 CPU1
handle_mm_fault()
do_wp_page()
wp_page_copy()
do_wp_page()
madvise(MADV_DONTNEED)
zap_page_range()
zap_pte_range()
ptep_get_and_clear_full()

__copy_from_user_inatomic()
sees empty PTE and fails
WARN_ON_ONCE(1)
clear_page()

The solution is to re-try __copy_from_user_inatomic() under PTL after
checking that PTE is matches the orig_pte.

The second copy attempt can still fail, like due to non-readable PTE, but
there's nothing reasonable we can do about, except clearing the CoW page.

Reported-by: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Kirill A. Shutemov
Tested-by: Jeff Moyer
Cc:
Cc: Justin He
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-03-06 21:06:09 +0800
8a8683ad9 mm: fix possible PMD dirty bit lost in set_pmd_migration_entry() ... Browse Code »

In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD
atomically. But the PMD is read before that with an ordinary memory
reading. If the THP (transparent huge page) is written between the PMD
reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause
data corruption. The race window is quite small, but still possible in
theory, so need to be fixed.

The race is fixed via using the return value of pmdp_invalidate() to get
the original content of PMD, which is a read/modify/write atomic
operation. So no THP writing can occur in between.

The race has been introduced when the THP migration support is added in
the commit 616b8371539a ("mm: thp: enable thp migration in generic path").
But this fix depends on the commit d52605d7cb30 ("mm: do not lose dirty
and accessed bits in pmdp_invalidate()"). So it's easy to be backported
after v4.16. But the race window is really small, so it may be fine not
to backport the fix at all.

Signed-off-by: Andrew Morton
Signed-off-by: "Huang, Ying"
Reviewed-by: Zi Yan
Reviewed-by: William Kucharski
Acked-by: Kirill A. Shutemov
Cc:
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Andrea Arcangeli
Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds

Huang Ying
2020-03-06 21:06:09 +0800
8b272b3cb mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa ... Browse Code »

: A user reported a bug against a distribution kernel while running a
: proprietary workload described as "memory intensive that is not swapping"
: that is expected to apply to mainline kernels. The workload is
: read/write/modifying ranges of memory and checking the contents. They
: reported that within a few hours that a bad PMD would be reported followed
: by a memory corruption where expected data was all zeros. A partial
: report of the bad PMD looked like
:
: [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
: [ 5195.341184] ------------[ cut here ]------------
: [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
: ....
: [ 5195.410033] Call Trace:
: [ 5195.410471] [] change_protection_range+0x7dd/0x930
: [ 5195.410716] [] change_prot_numa+0x18/0x30
: [ 5195.410918] [] task_numa_work+0x1fe/0x310
: [ 5195.411200] [] task_work_run+0x72/0x90
: [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2
: [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40
: [ 5195.411739] [] retint_user+0x8/0x10
:
: Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
: was a false detection. The bug does not trigger if automatic NUMA
: balancing or transparent huge pages is disabled.
:
: The bug is due a race in change_pmd_range between a pmd_trans_huge and
: pmd_nond_or_clear_bad check without any locks held. During the
: pmd_trans_huge check, a parallel protection update under lock can have
: cleared the PMD and filled it with a prot_numa entry between the transhuge
: check and the pmd_none_or_clear_bad check.
:
: While this could be fixed with heavy locking, it's only necessary to make
: a copy of the PMD on the stack during change_pmd_range and avoid races. A
: new helper is created for this as the check if quite subtle and the
: existing similar helpful is not suitable. This passed 154 hours of
: testing (usually triggers between 20 minutes and 24 hours) without
: detecting bad PMDs or corruption. A basic test of an autonuma-intensive
: workload showed no significant change in behaviour.

Although Mel withdrew the patch on the face of LKML comment
https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
still open, and we have reports of Linpack test reporting bad residuals
after the bad PMD warning is observed. In addition to that, bad
rss-counter and non-zero pgtables assertions are triggered on mm teardown
for the task hitting the bad PMD.

host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
....
host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

The issue is observed on a v4.18-based distribution kernel, but the race
window is expected to be applicable to mainline kernels, as well.

[akpm@linux-foundation.org: fix comment typo, per Rafael]
Signed-off-by: Andrew Morton
Signed-off-by: Rafael Aquini
Signed-off-by: Mel Gorman
Cc:
Cc: Zi Yan
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
Signed-off-by: Linus Torvalds

Mel Gorman
2020-03-06 21:06:09 +0800

25 Feb, 2020

1 commit

bc570c14b Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull tmpfs fix from Al Viro:
"Regression from fs_parse series this cycle..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
tmpfs: deny and force are not huge mount options

Linus Torvalds
2020-02-25 03:32:15 +0800

22 Feb, 2020

5 commits

63f01d852 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 fixes from Will Deacon:
"It's all straightforward apart from the changes to mmap()/mremap() in
relation to their handling of address arguments from userspace with
non-zero tag bits in the upper byte.

The change to brk() is necessary to fix a nasty user-visible
regression in malloc(), but we tightened up mmap() and mremap() at the
same time because they also allow the user to create virtual aliases
by accident. It's much less likely than brk() to matter in practice,
but enforcing the principle of "don't permit the creation of mappings
using tagged addresses" leads to a straightforward ABI without having
to worry about the "but what if a crazy program did foo?" aspect of
things.

Summary:

- Fix regression in malloc() caused by ignored address tags in brk()

- Add missing brackets around argument to untagged_addr() macro

- Fix clang build when using binutils assembler

- Fix silly typo in virtual memory map documentation"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()
docs: arm64: fix trivial spelling enought to enough in memory.rst
arm64: memory: Add missing brackets to untagged_addr() macro
arm64: lse: Fix LSE atomics with LLVM

Linus Torvalds
2020-02-22 08:03:36 +0800
18e19f195 mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM ... Browse Code »

When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
doesn't work before sparse_init_one_section() is called.

This leads to a crash when hotplug memory:

BUG: unable to handle page fault for address: 0000000006400000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:__memset+0x24/0x30
Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
sparse_add_section+0x1c9/0x26a
__add_pages+0xbf/0x150
add_pages+0x12/0x60
add_memory_resource+0xc8/0x210
__add_memory+0x62/0xb0
acpi_memory_device_add+0x13f/0x300
acpi_bus_attach+0xf6/0x200
acpi_bus_scan+0x43/0x90
acpi_device_hotplug+0x275/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40

We should use memmap as it did.

On x86 the impact is limited to x86_32 builds, or x86_64 configurations
that override the default setting for SPARSEMEM_VMEMMAP.

Other memory hotplug archs (arm64, ia64, and ppc) also default to
SPARSEMEM_VMEMMAP=y.

[dan.j.williams@intel.com: changelog update]
{rppt@linux.ibm.com: changelog update]
Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang
Signed-off-by: Baoquan He
Acked-by: David Hildenbrand
Reviewed-by: Baoquan He
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2020-02-22 03:22:15 +0800
76073c646 mm/vmscan.c: don't round up scan size for online memory cgroup ... Browse Code »

Commit 68600f623d69 ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline. This affects the overall
reclaiming behavior: the corresponding LRU list is eligible for
reclaiming only when its size logically right shifted by @sc->priority
is bigger than zero in the former formula.

For example, the inactive anonymous LRU list should have at least 0x4000
pages to be eligible for reclaiming when we have 60/12 for
swappiness/priority and without taking scan/rotation ratio into account.

After the roundup is applied, the inactive anonymous LRU list becomes
eligible for reclaiming when its size is bigger than or equal to 0x1000
in the same condition.

(0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

aarch64 has 512MB huge page size when the base page size is 64KB. The
memory cgroup that has a huge page is always eligible for reclaiming in
that case.

The reclaiming is likely to stop after the huge page is reclaimed,
meaing the further iteration on @sc->priority and the silbing and child
memory cgroups will be skipped. The overall behaviour has been changed.
This fixes the issue by applying the roundup to offlined memory cgroups
only, to give more preference to reclaim memory from offlined memory
cgroup. It sounds reasonable as those memory is unlikedly to be used by
anyone.

The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
2GB memory. It took 264 seconds for all VMs to be completely up and
784MB swap is consumed after that. With this patch applied, it took 236
seconds and 60MB swap to do same thing. So there is 10% performance
improvement for my case. Note that KSM is disable while THP is enabled
in the testing.

total used free shared buff/cache available
Mem: 16196 10065 2049 16 4081 3749
Swap: 8175 784 7391
total used free shared buff/cache available
Mem: 16196 11324 3656 24 1215 2936
Swap: 8175 60 8115

Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
Signed-off-by: Gavin Shan
Acked-by: Roman Gushchin
Cc: [4.20+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2020-02-22 03:22:15 +0800
75866af62 mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() ... Browse Code »

for_each_mem_cgroup() increases css reference counter for memory cgroup
and requires to use mem_cgroup_iter_break() if the walk is cancelled.

Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
Signed-off-by: Vasily Averin
Acked-by: Kirill Tkhai
Acked-by: Michal Hocko
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Averin
2020-02-22 03:22:15 +0800
fed98ef4d mm/swapfile.c: fix a comment in sys_swapon() ... Browse Code »

claim_swapfile now always takes i_rwsem.

Link: http://lkml.kernel.org/r/20200114161225.309792-2-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-02-22 03:22:15 +0800

20 Feb, 2020

1 commit

dcde23731 mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() ... Browse Code »

Currently the arm64 kernel ignores the top address byte passed to brk(),
mmap() and mremap(). When the user is not aware of the 56-bit address
limit or relies on the kernel to return an error, untagging such
pointers has the potential to create address aliases in user-space.
Passing a tagged address to munmap(), madvise() is permitted since the
tagged pointer is expected to be inside an existing mapping.

The current behaviour breaks the existing glibc malloc() implementation
which relies on brk() with an address beyond 56-bit to be rejected by
the kernel.

Remove untagging in the above functions by partially reverting commit
ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
addition, update the arm64 tagged-address-abi.rst document accordingly.

Link: https://bugzilla.redhat.com/1797052
Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
Cc: # 5.4.x-
Cc: Florian Weimer
Reviewed-by: Andrew Morton
Reported-by: Victor Stinner
Acked-by: Will Deacon
Acked-by: Andrey Konovalov
Signed-off-by: Catalin Marinas
Signed-off-by: Will Deacon

Catalin Marinas
2020-02-20 18:03:14 +0800

19 Feb, 2020

1 commit

bf4498ad3 tmpfs: deny and force are not huge mount options ... Browse Code »

5.6-rc1 commit 2710c957a8ef ("fs_parse: get rid of ->enums") regressed
the huge tmpfs mount options to an earlier state: "deny" and "force"
are not valid there, and can crash the kernel. Delete those lines.

Signed-off-by: Hugh Dickins
Signed-off-by: Al Viro

Hugh Dickins
2020-02-19 04:07:30 +0800

09 Feb, 2020

2 commits

c9d35ee04 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.

New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.

And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.

Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...

Linus Torvalds
2020-02-09 05:26:41 +0800
236f45329 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:

- bmap series from cmaiolino

- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors

Linus Torvalds
2020-02-09 05:04:49 +0800

08 Feb, 2020

3 commits

f35aa2bc8 tmpfs: switch to use of invalfc() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2020-02-08 03:48:44 +0800
d7167b149 fs_parse: fold fs_parameter_desc/fs_parameter_spec ... Browse Code »

The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro

Al Viro
2020-02-08 03:48:37 +0800
96cafb9cc fs_parser: remove fs_parameter_description name field ... Browse Code »

Unused now.

Signed-off-by: Eric Sandeen
Acked-by: David Howells
Signed-off-by: Al Viro

Eric Sandeen
2020-02-08 03:48:36 +0800

07 Feb, 2020

2 commits

5eede6252 fold struct fs_parameter_enum into struct constant_table ... Browse Code »

no real difference now

Signed-off-by: Al Viro

Al Viro
2020-02-07 13:12:50 +0800
2710c957a fs_parse: get rid of ->enums ... Browse Code »

Don't do a single array; attach them to fsparam_enum() entry
instead. And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.

Signed-off-by: Al Viro

Al Viro
2020-02-07 13:12:50 +0800

04 Feb, 2020

17 commits

cc12071ff Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"The rest of MM and the rest of everything else: hotfixes, ipc, misc,
procfs, lib, cleanups, arm"

* emailed patches from Andrew Morton : (67 commits)
ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()
treewide: remove redundant IS_ERR() before error code check
include/linux/cpumask.h: don't calculate length of the input string
lib: new testcases for bitmap_parse{_user}
lib: rework bitmap_parse()
lib: make bitmap_parse_user a wrapper on bitmap_parse
lib: add test for bitmap_parse()
bitops: more BITS_TO_* macros
lib/string: add strnchrnul()
proc: convert everything to "struct proc_ops"
proc: decouple proc from VFS with "struct proc_ops"
asm-generic/tlb: provide MMU_GATHER_TABLE_FREE
asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
asm-generic/tlb: add missing CONFIG symbol
asm-gemeric/tlb: remove stray function declarations
asm-generic/tlb: avoid potential double flush
mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
...

Linus Torvalds
2020-02-04 15:24:48 +0800
9717c1cea Merge tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm ... Browse Code »

Pull drm ttm/mm updates from Dave Airlie:
"Thomas Hellstrom has some more changes to the TTM layer that needed a
patch to the mm subsystem.

This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack
that has limitations in the TTM layer"

* tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm:
mm, drm/ttm: Fix vm page protection handling
mm: Add a vmf_insert_mixed_prot() function

Linus Torvalds
2020-02-04 15:21:04 +0800
97a32539b proc: convert everything to "struct proc_ops" ... Browse Code »

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

llseek => proc_lseek
unlocked_ioctl => proc_ioctl

xxx => proc_xxx

delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2020-02-04 11:05:26 +0800
0d6e24d43 asm-generic/tlb: provide MMU_GATHER_TABLE_FREE ... Browse Code »

As described in the comment, the correct order for freeing pages is:

1) unhook page
2) TLB invalidate page
3) free page

This order equally applies to page directories.

Currently there are two correct options:

- use tlb_remove_page(), when all page directores are full pages and
there are no futher contraints placed by things like software
walkers (HAVE_FAST_GUP).

- use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
architecture does not do IPI based TLB invalidate and has
HAVE_FAST_GUP (or software TLB fill).

This however leaves architectures that don't have page based directories
but don't need RCU in a bind. For those, provide MMU_GATHER_TABLE_FREE,
which provides the independent batching for directories without the
additional RCU freeing.

Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Aneesh Kumar K.V
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2020-02-04 11:05:26 +0800
580a586c4 asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER ... Browse Code »

Towards a more consistent naming scheme.

Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Aneesh Kumar K.V
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2020-02-04 11:05:26 +0800
3af4bd033 asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE ... Browse Code »

Towards a more consistent naming scheme.

Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Aneesh Kumar K.V
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2020-02-04 11:05:26 +0800
ff2e6d725 asm-generic/tlb: rename HAVE_RCU_TABLE_FREE ... Browse Code »

Towards a more consistent naming scheme.

[akpm@linux-foundation.org: fix sparc64 Kconfig]
Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Aneesh Kumar K.V
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2020-02-04 11:05:26 +0800
0ed132596 mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush ... Browse Code »

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI. This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table. With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Michael Ellerman [powerpc]
Cc: [4.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2020-02-04 11:05:26 +0800
e47690d75 x86: mm: avoid allocating struct mm_struct on the stack ... Browse Code »

struct mm_struct is quite large (~1664 bytes) and so allocating on the
stack may cause problems as the kernel stack size is small.

Since ptdump_walk_pgd_level_core() was only allocating the structure so
that it could modify the pgd argument we can instead introduce a pgd
override in struct mm_walk and pass this down the call stack to where it
is needed.

Since the correct mm_struct is now being passed down, it is now also
unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
now take the semaphore on the real mm.

[steven.price@arm.com: restore missed arm64 changes]
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Signed-off-by: Steven Price
Reported-by: Stephen Rothwell
Cc: Catalin Marinas
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
f8f0d0b6f mm: ptdump: reduce level numbers by 1 in note_page() ... Browse Code »

Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
let's change the meaning of 'level' in note_page() since that makes the
code simplier.

Note that for x86, the level numbers were previously increased by 1 in
commit 45dcd2091363 ("x86/mm/dump_pagetables: Fix printout of p4d level")
and the comment "Bit 7 has a different meaning" was not updated, so this
change also makes the code match the comment again.

Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.com
Signed-off-by: Steven Price
Reviewed-by: Catalin Marinas
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
30d621f67 mm: add generic ptdump ... Browse Code »

Add a generic version of page table dumping that architectures can opt-in
to.

Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
b7a16c7ad mm: pagewalk: add 'depth' parameter to pte_hole ... Browse Code »

The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is. Add this is an extra parameter to pte_hole(). When the
depth isn't know (e.g. processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price
Tested-by: Zong Li
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
c02a98753 mm: pagewalk: fix termination condition in walk_pte_range() ... Browse Code »

If walk_pte_range() is called with a 'end' argument that is beyond the
last page of memory (e.g. ~0UL) then the comparison between 'addr' and
'end' will always fail and the loop will be infinite. Instead change the
comparison to >= while accounting for overflow.

Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
fbf56346b mm: pagewalk: don't lock PTEs for walk_page_range_novma() ... Browse Code »

walk_page_range_novma() can be used to walk page tables or the kernel or
for firmware. These page tables may contain entries that are not backed
by a struct page and so it isn't (in general) possible to take the PTE
lock for the pte_entry() callback. So update walk_pte_range() to only
take the lock when no_vma==false by splitting out the inner loop to a
separate function and add a comment explaining the difference to
walk_page_range_novma().

Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
488ae6a2b mm: pagewalk: allow walking without vma ... Browse Code »

Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
because it lacks a vma.

This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.

Remove the requirement to have a vma in the generic code and add a new
function walk_page_range_novma() which ignores the VMAs and simply walks
the page tables.

Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
3afc42363 mm: pagewalk: add p4d_entry() and pgd_entry() ... Browse Code »

pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
transparent hugepages") already re-added pud_entry() but with different
semantics to the other callbacks. This commit reverts the semantics back
to match the other callbacks.

To support hmm.c which now uses the new semantics of pud_entry() a new
member ('action') of struct mm_walk is added which allows the callbacks to
either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
pud_trans_huge_lock() itself and make use of the splitting/retry logic of
the core code.

After this change pud_entry() is called for all entries, not just
transparent huge pages.

[arnd@arndb.de: fix unused variable warning]
Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
Signed-off-by: Steven Price
Signed-off-by: Arnd Bergmann
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Price
2020-02-04 11:05:25 +0800
1c948715a mm: remove __krealloc ... Browse Code »

Since 5.5-rc1 the last user of this function is gone, so remove the
functionality.

See commit
2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
for details.

Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
Signed-off-by: Florian Westphal
Acked-by: Andrew Morton
Acked-by: David Rientjes
Reviewed-by: David Hildenbrand
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Florian Westphal
2020-02-04 11:05:24 +0800