Eric Lee / smarc-fsl-linux-kernel

04 Jan, 2021

1 commit

a1dbfc3b1 Merge tag 'v5.10.4' into lf-5.10.y ... Browse Code »

This is the 5.10.4 stable release

* tag 'v5.10.4': (717 commits)
Linux 5.10.4
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
drm/edid: fix objtool warning in drm_cvt_modes()
...

Signed-off-by: Jason Liu

Conflicts:
drivers/gpu/drm/imx/dcss/dcss-plane.c
drivers/media/i2c/ov5640.c

Jason Liu
2021-01-04 20:34:49 +0800

30 Dec, 2020

2 commits

600ebd043 powerpc/mm: Fix verification of MMU_FTR_TYPE_44x ... Browse Code »

commit 17179aeb9d34cc81e1a4ae3f85e5b12b13a1f8d0 upstream.

MMU_FTR_TYPE_44x cannot be checked by cpu_has_feature()

Use mmu_has_feature() instead

Fixes: 23eb7f560a2a ("powerpc: Convert flush_icache_range & friends to C")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/ceede82fadf37f3b8275e61fcf8cf29a3e2ec7fe.1602351011.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman

Christophe Leroy
2020-12-30 18:54:16 +0800
7a5870d95 powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S ... Browse Code »

[ Upstream commit 7ceb40027e19567a0a066e3b380cc034cdd9a124 ]

The verification and message introduced by commit 374f3f5979f9
("powerpc/mm/hash: Handle user access of kernel address gracefully")
applies to all platforms, it should not be limited to BOOK3S.

Make the BOOK3S version of sanity_check_fault() the one for all,
and bail out earlier if not BOOK3S.

Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
Signed-off-by: Christophe Leroy
Reviewed-by: Nicholas Piggin
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/fe199d5af3578d3bf80035d203a94d742a7a28af.1607491748.git.christophe.leroy@csgroup.eu
Signed-off-by: Sasha Levin

Christophe Leroy
2020-12-30 18:53:44 +0800

14 Dec, 2020

1 commit

3704236af Add APIs to setup HugeTLB mappings for USDPAA ... Browse Code »

Recycled code from an old commit:

From: Roy Pledge
Signed-off-by: Zhao Qiang

Signed-off-by: Madalin Bucur

Madalin Bucur
2020-12-14 13:10:18 +0800

11 Dec, 2020

1 commit

47003b997 Merge tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc fix from Michael Ellerman:
"One commit to implement copy_from_kernel_nofault_allowed(), otherwise
copy_from_kernel_nofault() can trigger warnings when accessing bad
addresses in some configurations.

Thanks to Christophe Leroy and Qian Cai"

* tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()

Linus Torvalds
2020-12-11 08:36:30 +0800

08 Dec, 2020

1 commit

5eedf9fe8 powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed() ... Browse Code »

Since commit c33165253492 ("powerpc: use non-set_fs based maccess
routines"), userspace access is not granted anymore when using
copy_from_kernel_nofault()

However, kthread_probe_data() uses copy_from_kernel_nofault()
to check validity of pointers. When the pointer is NULL,
it points to userspace, leading to a KUAP fault and triggering
the following big hammer warning many times when you request
a sysrq "show task":

[ 1117.202054] ------------[ cut here ]------------
[ 1117.202102] Bug: fault blocked by AP register !
[ 1117.202261] WARNING: CPU: 0 PID: 377 at arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 do_page_fault+0x4a8/0x5ec
[ 1117.202310] Modules linked in:
[ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: G W 5.10.0-rc5-01340-g83f53be2de31-dirty #4175
[ 1117.202499] NIP: c0012048 LR: c0012048 CTR: 00000000
[ 1117.202573] REGS: cacdbb88 TRAP: 0700 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
[ 1117.202625] MSR: 00021032 CR: 24082222 XER: 20000000
[ 1117.202899]
[ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 00000023 c092e554 00000001 c09865e8 c092e640
[ 1117.202899] GPR08: 00001032 00000000 00000000 00014efc 28082224 100d166a 100a0920 00000000
[ 1117.202899] GPR16: 100cac0c 100b0000 1080c3fc 1080d685 100d0000 100d0000 00000000 100a0900
[ 1117.202899] GPR24: 100d0000 c07892ec 00000000 c0921510 c21f4440 0000005c c0000000 cacdbc80
[ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204509] Call Trace:
[ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
[ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
[ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
[ 1117.204979] NIP: c010dbec LR: c010dbac CTR: 00000001
[ 1117.205053] REGS: cacdbc80 TRAP: 0301 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
[ 1117.205104] MSR: 00009032 CR: 28082224 XER: 00000000
[ 1117.205416] DAR: 0000005c DSISR: c0000000
[ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 00000001 00000017 00000017 00000027 0000000f
[ 1117.205416] GPR08: c09926ec 00000000 00000000 3ffff000 24082224
[ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
[ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
[ 1117.206258] --- interrupt: 301
[ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
[ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
[ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
[ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
[ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
[ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
[ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
[ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
[ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
[ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
[ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
[ 1117.207938] --- interrupt: c01 at 0xfd4e784
[ 1117.208008] NIP: 0fd4e784 LR: 0fe0f244 CTR: 10048d38
[ 1117.208083] REGS: cacdbf48 TRAP: 0c01 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
[ 1117.208134] MSR: 0000d032 CR: 44002222 XER: 00000000
[ 1117.208470]
[ 1117.208470] GPR00: 00000004 7fc34090 77bfb4e0 00000001 1080fa40 00000002 7400000f fefefeff
[ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 00000000
[ 1117.209104] NIP [0fd4e784] 0xfd4e784
[ 1117.209180] LR [0fe0f244] 0xfe0f244
[ 1117.209236] --- interrupt: c01
[ 1117.209274] Instruction dump:
[ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 4082ff60
[ 1117.209727] 0fe00000 3c60c082 386399f4 48013b65 80010034 3860000b 7c0803a6
[ 1117.210102] ---[ end trace 1927c0323393af3e ]---

To avoid that, copy_from_kernel_nofault_allowed() is used to check
whether the address is a valid kernel address. But the default
version of it returns true for any address.

Provide a powerpc version of copy_from_kernel_nofault_allowed()
that returns false when the address is below TASK_USER_MAX,
so that copy_from_kernel_nofault() will return -ERANGE.

Fixes: c33165253492 ("powerpc: use non-set_fs based maccess routines")
Reported-by: Qian Cai
Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/18bcb456d32a3e74f5ae241fd6f1580c092d07f5.1607360230.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-12-08 07:22:09 +0800

06 Dec, 2020

1 commit

32f741b02 Merge tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.10:

- Three commits fixing possible missed TLB invalidations for
multi-threaded processes when CPUs are hotplugged in and out.

- A fix for a host crash triggerable by host userspace (qemu) in KVM
on Power9.

- A fix for a host crash in machine check handling when running HPT
guests on a HPT host.

- One commit fixing potential missed TLB invalidations when using the
hash MMU on Power9 or later.

- A regression fix for machines with CPUs on node 0 but no memory.

Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
Dronamraju"

* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
powerpc/numa: Fix a regression on memoryless node 0
powerpc/64s: Trim offlined CPUs from mm_cpumasks
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation

Linus Torvalds
2020-12-06 03:16:21 +0800

27 Nov, 2020

1 commit

10f78fd0d powerpc/numa: Fix a regression on memoryless node 0 ... Browse Code »

Commit e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
offlines node 0 and expects nodes to be subsequently onlined when CPUs
or nodes are detected.

Commit 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
skips onlining node 0 when CPUs are associated with node 0.

On systems with node 0 having CPUs but no memory, this causes node 0 be
marked offline. This causes issues at boot time when trying to set
memory node for online CPUs while building the zonelist.

0:mon> t
[link register ] c000000000400354 __build_all_zonelists+0x164/0x280
[c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
[c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
[c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
[c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
[c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
[c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
0:mon> r
R00 = c000000000400354 R16 = 000000000b57a0e8
R01 = c00000000161bda0 R17 = 000000000b57a6b0
R02 = c00000000161ce00 R18 = 000000000b5afee8
R03 = 0000000000000000 R19 = 000000000b6448a0
R04 = 0000000000000000 R20 = fffffffffffffffd
R05 = 0000000000000000 R21 = 0000000001400000
R06 = 0000000000000000 R22 = 000000001ec00000
R07 = 0000000000000001 R23 = c000000001175580
R08 = 0000000000000000 R24 = c000000001651ed8
R09 = c0000000017e84d8 R25 = c000000001652480
R10 = 0000000000000000 R26 = c000000001175584
R11 = c000000c7fac0d10 R27 = c0000000019568d0
R12 = c000000000400180 R28 = 0000000000000000
R13 = c000000002200000 R29 = c00000000164dd78
R14 = 000000000b579f78 R30 = 0000000000000000
R15 = 000000000b57a2b8 R31 = c000000001175584
pc = c000000000400194 local_memory_node+0x24/0x80
cfar= c000000000074334 mcount+0xc/0x10
lr = c000000000400354 __build_all_zonelists+0x164/0x280
msr = 8000000002001033 cr = 44002284
ctr = c000000000400180 xer = 0000000000000001 trap = 380
dar = 0000000000001388 dsisr = c00000000161bc90
0:mon>

Fix this by setting node to be online while onlining CPUs that belong to
node 0.

Fixes: e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
Fixes: 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
Reported-by: Milan Mohanty
Signed-off-by: Srikar Dronamraju
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-11-27 19:06:21 +0800

26 Nov, 2020

3 commits

01b0f0eae powerpc/64s: Trim offlined CPUs from mm_cpumasks ... Browse Code »

When offlining a CPU, powerpc/64s does not flush TLBs, rather it just
leaves the CPU set in mm_cpumasks, so it continues to receive TLBIEs
to manage its TLBs.

However the exit_flush_lazy_tlbs() function expects that after
returning, all CPUs (except self) have flushed TLBs for that mm, in
which case TLBIEL can be used for this flush. This breaks for offline
CPUs because they don't get the IPI to flush their TLB. This can lead
to stale translations.

Fix this by clearing the CPU from mm_cpumasks, then flushing all TLBs
before going offline.

These offlined CPU bits stuck in the cpumask also prevents the cpumask
from being trimmed back to local mode, which means continual broadcast
IPIs or TLBIEs are needed for TLB flushing. This patch prevents that
situation too.

A cast of many were involved in working this out, but in particular
Milton, Aneesh, Paul made key discoveries.

Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin
Reviewed-by: Aneesh Kumar K.V
Debugged-by: Milton Miller
Debugged-by: Aneesh Kumar K.V
Debugged-by: Paul Mackerras
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201126102530.691335-5-npiggin@gmail.com

Nicholas Piggin
2020-11-26 21:10:39 +0800
c0b27c517 powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels ... Browse Code »

tlbiel_all() can not be usable in !HVMODE when running hash presently,
remove HV privileged flushes when running in guest to make it usable.

Signed-off-by: Nicholas Piggin
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201126102530.691335-3-npiggin@gmail.com

Nicholas Piggin
2020-11-26 21:10:39 +0800
5844cc25f powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation ... Browse Code »

A typo has the R field of the instruction assigned by lucky dip a la
register allocator.

Fixes: d4748276ae14c ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Signed-off-by: Nicholas Piggin
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201126102530.691335-2-npiggin@gmail.com

Nicholas Piggin
2020-11-26 21:10:39 +0800

23 Nov, 2020

1 commit

a927bd6ba mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports ... Browse Code »

The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y

...and:

CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y

...it failed to export the symbol in the case of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n

Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.

Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.

The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.

The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap
Reported-by: Thomas Gleixner
Reported-by: kernel test robot
Reported-by: Christoph Hellwig
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Tested-by: Randy Dunlap
Tested-by: Thomas Gleixner
Reviewed-by: Thomas Gleixner
Reviewed-by: Christoph Hellwig
Cc: Joao Martins
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Vishal Verma
Cc: Stephen Rothwell
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds

Dan Williams
2020-11-23 02:48:22 +0800

17 Oct, 2020

2 commits

96685f866 Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc updates from Michael Ellerman:

- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.

- Remove support for PowerPC 601.

- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.

- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

- A basic idle driver for shallow stop states on Power10.

- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.

- A series doing further reworks & cleanups of our EEH code.

- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.

- Other smaller features, fixes & cleanups.

Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.

* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...

Linus Torvalds
2020-10-17 03:21:15 +0800
379c926d6 powerpc/mm: move setting pte specific flags to pfn_pte ... Browse Code »

powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.

We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without
setting _PAGE_PTE bit. We will remove that after a few releases.

With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.

[akpm@linux-foundation.org: whitespace fix, per Christophe]

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Reviewed-by: Christophe Leroy
Cc: Anshuman Khandual
Cc: Michael Ellerman
Link: https://lkml.kernel.org/r/20200902114222.181353-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2020-10-17 02:11:14 +0800

16 Oct, 2020

1 commit

5a32c3413 Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping ... Browse Code »

Pull dma-mapping updates from Christoph Hellwig:

- rework the non-coherent DMA allocator

- move private definitions out of

- lower CMA_ALIGNMENT (Paul Cercueil)

- remove the omap1 dma address translation in favor of the common code

- make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

- support per-node DMA CMA areas (Barry Song)

- increase the default seg boundary limit (Nicolin Chen)

- misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

- various cleanups

* tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
ARM/ixp4xx: add a missing include of dma-map-ops.h
dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
dma-direct: factor out a dma_direct_alloc_from_pool helper
dma-direct check for highmem pages in dma_direct_alloc_pages
dma-mapping: merge into
dma-mapping: move large parts of to kernel/dma
dma-mapping: move dma-debug.h to kernel/dma/
dma-mapping: remove
dma-mapping: merge into
dma-contiguous: remove dma_contiguous_set_default
dma-contiguous: remove dev_set_cma_area
dma-contiguous: remove dma_declare_contiguous
dma-mapping: split
cma: decrease CMA_ALIGNMENT lower limit to 2
firewire-ohci: use dma_alloc_pages
dma-iommu: implement ->alloc_noncoherent
dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
dma-mapping: add a new dma_alloc_pages API
dma-mapping: remove dma_cache_sync
53c700: convert to dma_alloc_noncoherent
...

Linus Torvalds
2020-10-16 05:43:29 +0800

14 Oct, 2020

2 commits

b10d6bca8 arch, drivers: replace for_each_membock() with for_each_mem_range() ... Browse Code »

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));

/* do something with start and end */
}

Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.

[akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build]
[rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring]
Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.com

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Daniel Axtens
Cc: Dave Hansen
Cc: Emil Renner Berthing
Cc: Hari Bathini
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Marek Szyprowski
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Miguel Ojeda
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Yoshinori Sato
Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-10-14 09:38:35 +0800
c9118e6c3 arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() ... Browse Code »

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);

/* do something with start_pfn and end_pfn */
}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Acked-by: Miguel Ojeda [.clang-format]
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Daniel Axtens
Cc: Dave Hansen
Cc: Emil Renner Berthing
Cc: Hari Bathini
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Marek Szyprowski
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Yoshinori Sato
Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-10-14 09:38:35 +0800

08 Oct, 2020

6 commits

69a1593ab powerpc/32s: Setup the early hash table at all time. ... Browse Code »

At the time being, an early hash table is set up when
CONFIG_KASAN is selected.

There is nothing wrong with setting such an early hash table
all the time, even if it is not used. This is a statically
allocated 256 kB table which lies in the init data section.

This makes the code simpler and may in the future allow to
setup early IO mappings with fixmap instead of hard coding BATs.

Put create_hpte() and flush_hash_pages() in the .ref.text section
in order to avoid warning for the reference to early_hash[]. This
reference is removed by MMU_init_hw_patch() before init memory is
freed.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/b8f8101c368b8a6451844a58d7bd7d83c14cf2aa.1601566529.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-10-08 18:17:14 +0800
2e38ea486 powerpc: Tidy up a bit after removal of PowerPC 601. ... Browse Code »

The removal of the 601 left some standalone blocks from
former if/else. Drop the { } and re-indent.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/31c4cd093963f22831bf388449056ee045533d3b.1601362098.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-10-08 18:17:13 +0800
8b14e1dff powerpc: Remove support for PowerPC 601 ... Browse Code »

PowerPC 601 has been retired.

Remove all associated specific code.

CPU_FTRS_PPC601 has CPU_FTR_COHERENT_ICACHE and CPU_FTR_COMMON.

CPU_FTR_COMMON is already present via other CPU_FTRS.
None of the remaining CPU selects CPU_FTR_COHERENT_ICACHE.

So CPU_FTRS_PPC601 can be removed from the possible features,
hence can be removed completely.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/60b725d55e21beec3335175c20b77903ff98284f.1601362098.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-10-08 18:17:13 +0800
d2a5cd83e powerpc: Drop SYNC_601() ISYNC_601() and SYNC() ... Browse Code »

Those macros are now empty at all time. Drop them.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/7990bb63fc53e460bfa94f8040184881d9e6fbc3.1601362098.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-10-08 18:17:13 +0800
fbf2f134c powerpc/lmb-size: Use addr #size-cells value when fetching lmb-size ... Browse Code »

Make it consistent with other usages.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201007114836.282468-5-aneesh.kumar@linux.ibm.com

Aneesh Kumar K.V
2020-10-08 09:50:52 +0800
950805f4d powerpc/book3s64/radix: Make radix_mem_block_size 64bit ... Browse Code »

Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.

Fixes: af9d00e93a4f ("powerpc/mm/radix: Create separate mappings for hot-plugged memory")
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20201007114836.282468-4-aneesh.kumar@linux.ibm.com

Aneesh Kumar K.V
2020-10-08 09:50:52 +0800

06 Oct, 2020

3 commits

72cdd117c pseries/hotplug-memory: hot-add: skip redundant LMB lookup ... Browse Code »

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid. In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup. The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range. This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
4,708 context-switches # 0.045 K/sec ( +- 0.69% )
2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
394 page-faults # 0.004 K/sec ( +- 0.22% )
445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
258,091,488,691 instructions # 0.58 insn per cycle
# 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)

105.583 +- 0.589 seconds time elapsed ( +- 0.56% )

Patched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
4,606 context-switches # 0.044 K/sec ( +- 0.20% )
2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
394 page-faults # 0.004 K/sec ( +- 0.25% )
442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
252,731,168,193 instructions # 0.57 insn per cycle
# 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)

104.829 +- 0.325 seconds time elapsed ( +- 0.31% )

This is consistent. An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first. On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha
Reviewed-by: Laurent Dufour
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com

Scott Cheloha
2020-10-06 20:22:27 +0800
05504b425 powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address ... Browse Code »

The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
cp_abort to the tlbiel sequence.

Signed-off-by: Nicholas Piggin
[mpe: Fixup whitespace and comment formatting]
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com

Nicholas Piggin
2020-10-06 20:22:23 +0800
9f4df96b8 dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h> ... Browse Code »

Move more nitty gritty DMA implementation details into the common
internal header.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2020-10-06 13:07:06 +0800

18 Sep, 2020

2 commits

ef1edbba5 powerpc/mm/64s: Fix slb_setup_new_exec() sparse warning ... Browse Code »

Sparse says:
symbol slb_setup_new_exec was not declared. Should it be static?

No, it should have a declaration in a header, add one.

Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200916115637.3100484-1-mpe@ellerman.id.au

Michael Ellerman
2020-09-18 17:59:43 +0800
0b30191b2 Merge branch 'topic/irqs-off-activate-mm' into next ... Browse Code »

Merge Nick's series to add ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Michael Ellerman
2020-09-18 16:14:06 +0800

16 Sep, 2020

8 commits

fa35e868f powerpc/smp: Implement cpu_to_coregroup_id ... Browse Code »

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Signed-off-by: Srikar Dronamraju
Reviewed-by: Gautham R. Shenoy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:13:32 +0800
72730bfc2 powerpc/smp: Create coregroup domain ... Browse Code »

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Signed-off-by: Srikar Dronamraju
Reviewed-by: Gautham R. Shenoy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:13:32 +0800
f9f130ff2 powerpc/numa: Detect support for coregroup ... Browse Code »

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract. Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Signed-off-by: Srikar Dronamraju
Reviewed-by: Gautham R. Shenoy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:13:31 +0800
e75130f20 powerpc/numa: Offline memoryless cpuless node 0 ... Browse Code »

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.

v5.8
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10

proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

v5.8 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10

proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31

Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Signed-off-by: Srikar Dronamraju
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:05:20 +0800
6398eaa26 powerpc/numa: Prefer node id queried from vphn ... Browse Code »

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Signed-off-by: Srikar Dronamraju
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:05:19 +0800
a874f1005 powerpc/numa: Set numa_node for all possible cpus ... Browse Code »

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Signed-off-by: Srikar Dronamraju
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:05:19 +0800
67df77845 powerpc/numa: Restrict possible nodes based on platform ... Browse Code »

As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf

... there are 2 device tree properties:

"ibm,max-associativity-domains"
which defines the maximum number of domains that the firmware i.e
PowerVM can support.

and:

"ibm,current-associativity-domains"
which defines the maximum number of domains that the current
platform can support.

The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section

Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.

Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u

He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.

However with the recent commit dbce45628085 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.

$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
00000005 00000001 00000002 00000002 00000002 00000010
/proc/device-tree/rtas/ibm,max-associativity-domains
00000005 00000001 00000008 00000020 00000020 00000100

$ cat /sys/devices/system/node/possible ##Before patch
0-31

$ cat /sys/devices/system/node/possible ##After patch
0-1

Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.

This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.

I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).

Before patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 4106816 518820608 22272 570752 516606720
Swap: 4194240 0 4194240

free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4628416 518246464 22336 623296 516058688
Swap: 4194240 0 4194240

free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4697408 518173760 22400 627008 515987904
Swap: 4194240 0 4194240

After patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 3969472 518933888 22272 594816 516731776
Swap: 4194240 0 4194240

free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4181888 518676096 22208 640192 516496448
Swap: 4194240 0 4194240

free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4232320 518619904 22272 645952 516443264
Swap: 4194240 0 4194240

Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

Signed-off-by: Srikar Dronamraju
[mpe: Reformat change log a bit for readability]
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com

Srikar Dronamraju
2020-09-16 20:05:19 +0800
a665eec0a powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm ... Browse Code »

Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin
Reviewed-by: Michael Ellerman
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com

Nicholas Piggin
2020-09-16 10:24:37 +0800

15 Sep, 2020

4 commits

79b123cdf powerepc/book3s64/hash: Align start/end address correctly with bolt mapping ... Browse Code »

This ensures we don't do a partial mapping of memory. With nvdimm, when
creating namespaces with size not aligned to 16MB, the kernel ends up partially
mapping the pages. This can result in kernel adding multiple hash page table
entries for the same range. A new namespace will result in
create_section_mapping() with start and end overlapping an already existing
bolted hash page table entry.

commit: 6acd7d5ef264 ("libnvdimm/namespace: Enforce memremap_compat_align()")
made sure that we always create namespaces aligned to 16MB. But we can do
better by avoiding mapping pages that are not aligned. This helps to catch
access to these partially mapped pages early.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200907072539.67310-1-aneesh.kumar@linux.ibm.com

Aneesh Kumar K.V
2020-09-15 20:13:38 +0800
4c42dc5c6 powerpc/kasan: Fix CONFIG_KASAN_VMALLOC for 8xx ... Browse Code »

Before the commit identified below, pages tables allocation was
performed after the allocation of final shadow area for linear memory.
But that commit switched the order, leading to page tables being
already allocated at the time 8xx kasan_init_shadow_8M() is called.
Due to this, kasan_init_shadow_8M() doesn't map the needed
shadow entries because there are already page tables.

kasan_init_shadow_8M() installs huge PMD entries instead of page
tables. We could at that time free the page tables, but there is no
point in creating page tables that get freed before being used.

Only book3s/32 hash needs early allocation of page tables. For other
variants, we can keep the initial order and create remaining page
tables after the allocation of final shadow memory for linear mem.

Move back the allocation of shadow page tables for
CONFIG_KASAN_VMALLOC into kasan_init() after the loop which creates
final shadow memory for linear mem.

Fixes: 41ea93cf7ba4 ("powerpc/kasan: Fix shadow pages allocation failure")
Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/8ae4554357da4882612644a74387ae05525b2aaa.1599800716.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-09-15 20:13:37 +0800
e47168f3d powerpc/8xx: Support 16k hugepages with 4k pages ... Browse Code »

The 8xx has 4 page sizes: 4k, 16k, 512k and 8M

4k and 16k can be selected at build time as standard page sizes,
and 512k and 8M are hugepages.

When 4k standard pages are selected, 16k pages are not available.

Allow 16k pages as hugepages when 4k pages are used.

To allow that, implement arch_make_huge_pte() which receives
the necessary arguments to allow setting the PTE in accordance
with the page size:
- 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
by pte_mkhuge(). arch_make_huge_pte() does nothing.
- 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
_PAGE_HUGE.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/a518abc29266a708dfbccc8fce9ae6694fe4c2c6.1598862623.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-09-15 20:13:31 +0800
175a99991 powerpc/8xx: Refactor calculation of number of entries per PTE in page tables ... Browse Code »

On 8xx, the number of entries occupied by a PTE in the page tables
depends on the size of the page. At the time being, this calculation
is done in two places: in pte_update() and in set_huge_pte_at()

Refactor this calculation into a helper called
number_of_cells_per_pte(). For the time being, the val param is
unused. It will be used by following patch.

Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
declaration.

Signed-off-by: Christophe Leroy
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/f6ea2483c2c389567b007945948f704d18cfaeea.1598862623.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-09-15 20:13:31 +0800