04 Jan, 2021

1 commit

  • This is the 5.10.4 stable release

    * tag 'v5.10.4': (717 commits)
    Linux 5.10.4
    x86/CPU/AMD: Save AMD NodeId as cpu_die_id
    drm/edid: fix objtool warning in drm_cvt_modes()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/gpu/drm/imx/dcss/dcss-plane.c
    drivers/media/i2c/ov5640.c

    Jason Liu
     

30 Dec, 2020

2 commits

  • commit 17179aeb9d34cc81e1a4ae3f85e5b12b13a1f8d0 upstream.

    MMU_FTR_TYPE_44x cannot be checked by cpu_has_feature()

    Use mmu_has_feature() instead

    Fixes: 23eb7f560a2a ("powerpc: Convert flush_icache_range & friends to C")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/ceede82fadf37f3b8275e61fcf8cf29a3e2ec7fe.1602351011.git.christophe.leroy@csgroup.eu
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit 7ceb40027e19567a0a066e3b380cc034cdd9a124 ]

    The verification and message introduced by commit 374f3f5979f9
    ("powerpc/mm/hash: Handle user access of kernel address gracefully")
    applies to all platforms, it should not be limited to BOOK3S.

    Make the BOOK3S version of sanity_check_fault() the one for all,
    and bail out earlier if not BOOK3S.

    Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
    Signed-off-by: Christophe Leroy
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/fe199d5af3578d3bf80035d203a94d742a7a28af.1607491748.git.christophe.leroy@csgroup.eu
    Signed-off-by: Sasha Levin

    Christophe Leroy
     

14 Dec, 2020

1 commit


11 Dec, 2020

1 commit


08 Dec, 2020

1 commit

  • Since commit c33165253492 ("powerpc: use non-set_fs based maccess
    routines"), userspace access is not granted anymore when using
    copy_from_kernel_nofault()

    However, kthread_probe_data() uses copy_from_kernel_nofault()
    to check validity of pointers. When the pointer is NULL,
    it points to userspace, leading to a KUAP fault and triggering
    the following big hammer warning many times when you request
    a sysrq "show task":

    [ 1117.202054] ------------[ cut here ]------------
    [ 1117.202102] Bug: fault blocked by AP register !
    [ 1117.202261] WARNING: CPU: 0 PID: 377 at arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 do_page_fault+0x4a8/0x5ec
    [ 1117.202310] Modules linked in:
    [ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: G W 5.10.0-rc5-01340-g83f53be2de31-dirty #4175
    [ 1117.202499] NIP: c0012048 LR: c0012048 CTR: 00000000
    [ 1117.202573] REGS: cacdbb88 TRAP: 0700 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
    [ 1117.202625] MSR: 00021032 CR: 24082222 XER: 20000000
    [ 1117.202899]
    [ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 00000023 c092e554 00000001 c09865e8 c092e640
    [ 1117.202899] GPR08: 00001032 00000000 00000000 00014efc 28082224 100d166a 100a0920 00000000
    [ 1117.202899] GPR16: 100cac0c 100b0000 1080c3fc 1080d685 100d0000 100d0000 00000000 100a0900
    [ 1117.202899] GPR24: 100d0000 c07892ec 00000000 c0921510 c21f4440 0000005c c0000000 cacdbc80
    [ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
    [ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
    [ 1117.204509] Call Trace:
    [ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
    [ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
    [ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
    [ 1117.204979] NIP: c010dbec LR: c010dbac CTR: 00000001
    [ 1117.205053] REGS: cacdbc80 TRAP: 0301 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
    [ 1117.205104] MSR: 00009032 CR: 28082224 XER: 00000000
    [ 1117.205416] DAR: 0000005c DSISR: c0000000
    [ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 00000001 00000017 00000017 00000027 0000000f
    [ 1117.205416] GPR08: c09926ec 00000000 00000000 3ffff000 24082224
    [ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
    [ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
    [ 1117.206258] --- interrupt: 301
    [ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
    [ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
    [ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
    [ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
    [ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
    [ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
    [ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
    [ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
    [ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
    [ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
    [ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
    [ 1117.207938] --- interrupt: c01 at 0xfd4e784
    [ 1117.208008] NIP: 0fd4e784 LR: 0fe0f244 CTR: 10048d38
    [ 1117.208083] REGS: cacdbf48 TRAP: 0c01 Tainted: G W (5.10.0-rc5-01340-g83f53be2de31-dirty)
    [ 1117.208134] MSR: 0000d032 CR: 44002222 XER: 00000000
    [ 1117.208470]
    [ 1117.208470] GPR00: 00000004 7fc34090 77bfb4e0 00000001 1080fa40 00000002 7400000f fefefeff
    [ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 00000000
    [ 1117.209104] NIP [0fd4e784] 0xfd4e784
    [ 1117.209180] LR [0fe0f244] 0xfe0f244
    [ 1117.209236] --- interrupt: c01
    [ 1117.209274] Instruction dump:
    [ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 4082ff60
    [ 1117.209727] 0fe00000 3c60c082 386399f4 48013b65 80010034 3860000b 7c0803a6
    [ 1117.210102] ---[ end trace 1927c0323393af3e ]---

    To avoid that, copy_from_kernel_nofault_allowed() is used to check
    whether the address is a valid kernel address. But the default
    version of it returns true for any address.

    Provide a powerpc version of copy_from_kernel_nofault_allowed()
    that returns false when the address is below TASK_USER_MAX,
    so that copy_from_kernel_nofault() will return -ERANGE.

    Fixes: c33165253492 ("powerpc: use non-set_fs based maccess routines")
    Reported-by: Qian Cai
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/18bcb456d32a3e74f5ae241fd6f1580c092d07f5.1607360230.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     

06 Dec, 2020

1 commit

  • Pull powerpc fixes from Michael Ellerman:
    "Some more powerpc fixes for 5.10:

    - Three commits fixing possible missed TLB invalidations for
    multi-threaded processes when CPUs are hotplugged in and out.

    - A fix for a host crash triggerable by host userspace (qemu) in KVM
    on Power9.

    - A fix for a host crash in machine check handling when running HPT
    guests on a HPT host.

    - One commit fixing potential missed TLB invalidations when using the
    hash MMU on Power9 or later.

    - A regression fix for machines with CPUs on node 0 but no memory.

    Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
    Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
    Dronamraju"

    * tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
    KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
    powerpc/numa: Fix a regression on memoryless node 0
    powerpc/64s: Trim offlined CPUs from mm_cpumasks
    kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
    powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
    powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation

    Linus Torvalds
     

27 Nov, 2020

1 commit

  • Commit e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
    offlines node 0 and expects nodes to be subsequently onlined when CPUs
    or nodes are detected.

    Commit 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
    skips onlining node 0 when CPUs are associated with node 0.

    On systems with node 0 having CPUs but no memory, this causes node 0 be
    marked offline. This causes issues at boot time when trying to set
    memory node for online CPUs while building the zonelist.

    0:mon> t
    [link register ] c000000000400354 __build_all_zonelists+0x164/0x280
    [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
    [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
    [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
    [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
    [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
    [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
    0:mon> r
    R00 = c000000000400354 R16 = 000000000b57a0e8
    R01 = c00000000161bda0 R17 = 000000000b57a6b0
    R02 = c00000000161ce00 R18 = 000000000b5afee8
    R03 = 0000000000000000 R19 = 000000000b6448a0
    R04 = 0000000000000000 R20 = fffffffffffffffd
    R05 = 0000000000000000 R21 = 0000000001400000
    R06 = 0000000000000000 R22 = 000000001ec00000
    R07 = 0000000000000001 R23 = c000000001175580
    R08 = 0000000000000000 R24 = c000000001651ed8
    R09 = c0000000017e84d8 R25 = c000000001652480
    R10 = 0000000000000000 R26 = c000000001175584
    R11 = c000000c7fac0d10 R27 = c0000000019568d0
    R12 = c000000000400180 R28 = 0000000000000000
    R13 = c000000002200000 R29 = c00000000164dd78
    R14 = 000000000b579f78 R30 = 0000000000000000
    R15 = 000000000b57a2b8 R31 = c000000001175584
    pc = c000000000400194 local_memory_node+0x24/0x80
    cfar= c000000000074334 mcount+0xc/0x10
    lr = c000000000400354 __build_all_zonelists+0x164/0x280
    msr = 8000000002001033 cr = 44002284
    ctr = c000000000400180 xer = 0000000000000001 trap = 380
    dar = 0000000000001388 dsisr = c00000000161bc90
    0:mon>

    Fix this by setting node to be online while onlining CPUs that belong to
    node 0.

    Fixes: e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0")
    Fixes: 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn")
    Reported-by: Milan Mohanty
    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     

26 Nov, 2020

3 commits

  • When offlining a CPU, powerpc/64s does not flush TLBs, rather it just
    leaves the CPU set in mm_cpumasks, so it continues to receive TLBIEs
    to manage its TLBs.

    However the exit_flush_lazy_tlbs() function expects that after
    returning, all CPUs (except self) have flushed TLBs for that mm, in
    which case TLBIEL can be used for this flush. This breaks for offline
    CPUs because they don't get the IPI to flush their TLB. This can lead
    to stale translations.

    Fix this by clearing the CPU from mm_cpumasks, then flushing all TLBs
    before going offline.

    These offlined CPU bits stuck in the cpumask also prevents the cpumask
    from being trimmed back to local mode, which means continual broadcast
    IPIs or TLBIEs are needed for TLB flushing. This patch prevents that
    situation too.

    A cast of many were involved in working this out, but in particular
    Milton, Aneesh, Paul made key discoveries.

    Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Debugged-by: Milton Miller
    Debugged-by: Aneesh Kumar K.V
    Debugged-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201126102530.691335-5-npiggin@gmail.com

    Nicholas Piggin
     
  • tlbiel_all() can not be usable in !HVMODE when running hash presently,
    remove HV privileged flushes when running in guest to make it usable.

    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201126102530.691335-3-npiggin@gmail.com

    Nicholas Piggin
     
  • A typo has the R field of the instruction assigned by lucky dip a la
    register allocator.

    Fixes: d4748276ae14c ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201126102530.691335-2-npiggin@gmail.com

    Nicholas Piggin
     

23 Nov, 2020

1 commit

  • The core-mm has a default __weak implementation of phys_to_target_node()
    to mirror the weak definition of memory_add_physaddr_to_nid(). That
    symbol is exported for modules. However, while the export in
    mm/memory_hotplug.c exported the symbol in the configuration cases of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=y

    ...and:

    CONFIG_NUMA_KEEP_MEMINFO=n
    CONFIG_MEMORY_HOTPLUG=y

    ...it failed to export the symbol in the case of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=n

    Not only is that broken, but Christoph points out that the kernel should
    not be exporting any __weak symbol, which means that
    memory_add_physaddr_to_nid() example that phys_to_target_node() copied
    is broken too.

    Rework the definition of phys_to_target_node() and
    memory_add_physaddr_to_nid() to not require weak symbols. Move to the
    common arch override design-pattern of an asm header defining a symbol
    to replace the default implementation.

    The only common header that all memory_add_physaddr_to_nid() producing
    architectures implement is asm/sparsemem.h. In fact, powerpc already
    defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
    Double-down on that observation and define phys_to_target_node() where
    necessary in asm/sparsemem.h. An alternate consideration that was
    discarded was to put this override in asm/numa.h, but that entangles
    with the definition of MAX_NUMNODES relative to the inclusion of
    linux/nodemask.h, and requires powerpc to grow a new header.

    The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
    now that the symbol is properly exported / stubbed in all combinations
    of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

    [dan.j.williams@intel.com: v4]
    Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
    [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
    Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

    Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
    Reported-by: Randy Dunlap
    Reported-by: Thomas Gleixner
    Reported-by: kernel test robot
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Tested-by: Thomas Gleixner
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Christoph Hellwig
    Cc: Joao Martins
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Vishal Verma
    Cc: Stephen Rothwell
    Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

17 Oct, 2020

2 commits

  • Pull powerpc updates from Michael Ellerman:

    - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
    it for powerpc, as well as a related fix for sparc.

    - Remove support for PowerPC 601.

    - Some fixes for watchpoints & addition of a new ptrace flag for
    detecting ISA v3.1 (Power10) watchpoint features.

    - A fix for kernels using 4K pages and the hash MMU on bare metal
    Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

    - A basic idle driver for shallow stop states on Power10.

    - Tweaks to our sched domains code to better inform the scheduler about
    the hardware topology on Power9/10, where two SMT4 cores can be
    presented by firmware as an SMT8 core.

    - A series doing further reworks & cleanups of our EEH code.

    - Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
    to prevent root from overwriting kernel memory.

    - Other smaller features, fixes & cleanups.

    Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
    Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
    Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
    Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
    Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
    Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
    Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
    Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
    Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
    Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
    Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
    Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
    Yingliang, zhengbin.

    * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
    Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
    selftests/powerpc: Fix eeh-basic.sh exit codes
    cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
    powerpc/time: Make get_tb() common to PPC32 and PPC64
    powerpc/time: Make get_tbl() common to PPC32 and PPC64
    powerpc/time: Remove get_tbu()
    powerpc/time: Avoid using get_tbl() and get_tbu() internally
    powerpc/time: Make mftb() common to PPC32 and PPC64
    powerpc/time: Rename mftbl() to mftb()
    powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
    powerpc/32s: Rename head_32.S to head_book3s_32.S
    powerpc/32s: Setup the early hash table at all time.
    powerpc/time: Remove ifdef in get_dec() and set_dec()
    powerpc: Remove get_tb_or_rtc()
    powerpc: Remove __USE_RTC()
    powerpc: Tidy up a bit after removal of PowerPC 601.
    powerpc: Remove support for PowerPC 601
    powerpc: Remove PowerPC 601
    powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
    powerpc: Remove CONFIG_PPC601_SYNC_FIX
    ...

    Linus Torvalds
     
  • powerpc used to set the pte specific flags in set_pte_at(). This is
    different from other architectures. To be consistent with other
    architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
    unused pte_mkpte.

    We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without
    setting _PAGE_PTE bit. We will remove that after a few releases.

    With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
    _PAGE_PTE bit.

    [akpm@linux-foundation.org: whitespace fix, per Christophe]

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Reviewed-by: Christophe Leroy
    Cc: Anshuman Khandual
    Cc: Michael Ellerman
    Link: https://lkml.kernel.org/r/20200902114222.181353-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

16 Oct, 2020

1 commit

  • Pull dma-mapping updates from Christoph Hellwig:

    - rework the non-coherent DMA allocator

    - move private definitions out of

    - lower CMA_ALIGNMENT (Paul Cercueil)

    - remove the omap1 dma address translation in favor of the common code

    - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

    - support per-node DMA CMA areas (Barry Song)

    - increase the default seg boundary limit (Nicolin Chen)

    - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

    - various cleanups

    * tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
    ARM/ixp4xx: add a missing include of dma-map-ops.h
    dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
    dma-direct: factor out a dma_direct_alloc_from_pool helper
    dma-direct check for highmem pages in dma_direct_alloc_pages
    dma-mapping: merge into
    dma-mapping: move large parts of to kernel/dma
    dma-mapping: move dma-debug.h to kernel/dma/
    dma-mapping: remove
    dma-mapping: merge into
    dma-contiguous: remove dma_contiguous_set_default
    dma-contiguous: remove dev_set_cma_area
    dma-contiguous: remove dma_declare_contiguous
    dma-mapping: split
    cma: decrease CMA_ALIGNMENT lower limit to 2
    firewire-ohci: use dma_alloc_pages
    dma-iommu: implement ->alloc_noncoherent
    dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
    dma-mapping: add a new dma_alloc_pages API
    dma-mapping: remove dma_cache_sync
    53c700: convert to dma_alloc_noncoherent
    ...

    Linus Torvalds
     

14 Oct, 2020

2 commits

  • There are several occurrences of the following pattern:

    for_each_memblock(memory, reg) {
    start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
    end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));

    /* do something with start and end */
    }

    Using for_each_mem_range() iterator is more appropriate in such cases and
    allows simpler and cleaner code.

    [akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build]
    [rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring]
    Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.com

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Miguel Ojeda
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • There are several occurrences of the following pattern:

    for_each_memblock(memory, reg) {
    start_pfn = memblock_region_memory_base_pfn(reg);
    end_pfn = memblock_region_memory_end_pfn(reg);

    /* do something with start_pfn and end_pfn */
    }

    Rather than iterate over all memblock.memory regions and each time query
    for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
    simpler and clearer code.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Miguel Ojeda [.clang-format]
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Oct, 2020

6 commits

  • At the time being, an early hash table is set up when
    CONFIG_KASAN is selected.

    There is nothing wrong with setting such an early hash table
    all the time, even if it is not used. This is a statically
    allocated 256 kB table which lies in the init data section.

    This makes the code simpler and may in the future allow to
    setup early IO mappings with fixmap instead of hard coding BATs.

    Put create_hpte() and flush_hash_pages() in the .ref.text section
    in order to avoid warning for the reference to early_hash[]. This
    reference is removed by MMU_init_hw_patch() before init memory is
    freed.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/b8f8101c368b8a6451844a58d7bd7d83c14cf2aa.1601566529.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • The removal of the 601 left some standalone blocks from
    former if/else. Drop the { } and re-indent.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/31c4cd093963f22831bf388449056ee045533d3b.1601362098.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • PowerPC 601 has been retired.

    Remove all associated specific code.

    CPU_FTRS_PPC601 has CPU_FTR_COHERENT_ICACHE and CPU_FTR_COMMON.

    CPU_FTR_COMMON is already present via other CPU_FTRS.
    None of the remaining CPU selects CPU_FTR_COHERENT_ICACHE.

    So CPU_FTRS_PPC601 can be removed from the possible features,
    hence can be removed completely.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/60b725d55e21beec3335175c20b77903ff98284f.1601362098.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • Those macros are now empty at all time. Drop them.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/7990bb63fc53e460bfa94f8040184881d9e6fbc3.1601362098.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • Make it consistent with other usages.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201007114836.282468-5-aneesh.kumar@linux.ibm.com

    Aneesh Kumar K.V
     
  • Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block panic")
    make sure different variables tracking lmb_size are updated to be 64 bit.

    Fixes: af9d00e93a4f ("powerpc/mm/radix: Create separate mappings for hot-plugged memory")
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201007114836.282468-4-aneesh.kumar@linux.ibm.com

    Aneesh Kumar K.V
     

06 Oct, 2020

3 commits

  • During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
    to determine which node id (nid) to use when later calling __add_memory().

    This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
    appropriate nid for a given address by looking up the LMB containing the
    address and then passing that LMB to of_drconf_to_nid_single() to get the
    nid. In dlpar_add_lmb() we get this address from the LMB itself.

    In short, we have a pointer to an LMB and then we are searching for
    that LMB *again* in order to find its nid.

    If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
    can skip the redundant lookup. The only error handling we need to
    duplicate from memory_add_physaddr_to_nid() is the fallback to the
    default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
    an invalid nid.

    Skipping the extra lookup makes hot-add operations faster, especially
    on machines with many LMBs.

    Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
    LMBs on an upatched kernel took ~3.5 hours while a patched kernel
    completed the same operation in ~2 hours:

    Unpatched (12450 seconds):
    Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
    Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
    [...]
    Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

    Patched (7065 seconds):
    Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
    Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
    [...]
    Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

    It should be noted that the speedup grows more substantial when
    hot-adding LMBs at the end of the drconf range. This is because we
    are skipping a linear LMB search.

    To see the distinction, consider smaller hot-add test on the same
    LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
    LMBs completed less than 1 second faster on a patched kernel:

    Unpatched:
    Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

    104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
    4,708 context-switches # 0.045 K/sec ( +- 0.69% )
    2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
    394 page-faults # 0.004 K/sec ( +- 0.22% )
    445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
    8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
    300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
    258,091,488,691 instructions # 0.58 insn per cycle
    # 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
    70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
    3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)

    105.583 +- 0.589 seconds time elapsed ( +- 0.56% )

    Patched:
    Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

    104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
    4,606 context-switches # 0.044 K/sec ( +- 0.20% )
    2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
    394 page-faults # 0.004 K/sec ( +- 0.25% )
    442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
    8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
    299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
    252,731,168,193 instructions # 0.57 insn per cycle
    # 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
    68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
    3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)

    104.829 +- 0.325 seconds time elapsed ( +- 0.31% )

    This is consistent. An add-by-count hot-add operation adds LMBs
    greedily, so LMBs near the start of the drconf range are considered
    first. On an otherwise idle LPAR with so many LMBs we would expect to
    find the LMBs we need near the start of the drconf range, hence the
    smaller speedup.

    Signed-off-by: Scott Cheloha
    Reviewed-by: Laurent Dufour
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com

    Scott Cheloha
     
  • The copy buffer is implemented as a real address in the nest which is
    translated from EA by copy, and used for memory access by paste. This
    requires that it be invalidated by TLB invalidation.

    TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
    cp_abort to the tlbiel sequence.

    Signed-off-by: Nicholas Piggin
    [mpe: Fixup whitespace and comment formatting]
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com

    Nicholas Piggin
     
  • Move more nitty gritty DMA implementation details into the common
    internal header.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

18 Sep, 2020

2 commits


16 Sep, 2020

8 commits

  • Lookup the coregroup id from the associativity array.

    If unable to detect the coregroup id, fallback on the core id.
    This way, ensure sched_domain degenerates and an extra sched domain is
    not created.

    Ideally this function should have been implemented in
    arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
    don't need to find the primary domain again.

    If the device-tree mentions more than one coregroup, then kernel
    implements only the last or the smallest coregroup, which currently
    corresponds to the penultimate domain in the device-tree.

    Signed-off-by: Srikar Dronamraju
    Reviewed-by: Gautham R. Shenoy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • Add percpu coregroup maps and masks to create coregroup domain.
    If a coregroup doesn't exist, the coregroup domain will be degenerated
    in favour of SMT/CACHE domain. Do note this patch is only creating stubs
    for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
    would be in a subsequent patch.

    Signed-off-by: Srikar Dronamraju
    Reviewed-by: Gautham R. Shenoy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • Add support for grouping cores based on the device-tree classification.
    - The last domain in the associativity domains always refers to the
    core.
    - If primary reference domain happens to be the penultimate domain in
    the associativity domains device-tree property, then there are no
    coregroups. However if its not a penultimate domain, then there are
    coregroups. There can be more than one coregroup. For now we would be
    interested in the last or the smallest coregroups, i.e one sub-group
    per DIE.

    Currently there are no firmwares that are exposing this grouping. Hence
    allow the basis for grouping to be abstract. Once the firmware starts
    using this grouping, code would be added to detect the type of grouping
    and adjust the sd domain flags accordingly.

    Signed-off-by: Srikar Dronamraju
    Reviewed-by: Gautham R. Shenoy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • Currently Linux kernel with CONFIG_NUMA on a system with multiple
    possible nodes, marks node 0 as online at boot. However in practice,
    there are systems which have node 0 as memoryless and cpuless.

    This can cause numa_balancing to be enabled on systems with only one node
    with memory and CPUs. The existence of this dummy node which is cpuless and
    memoryless node can confuse users/scripts looking at output of lscpu /
    numactl.

    By marking, node 0 as offline, lets stop assuming that node 0 is
    always online. If node 0 has CPU or memory that are online, node 0 will
    again be set as online.

    v5.8
    available: 2 nodes (0,2)
    node 0 cpus:
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus: 0 1 2 3 4 5 6 7
    node 2 size: 32625 MB
    node 2 free: 31490 MB
    node distances:
    node 0 2
    0: 10 20
    2: 20 10

    proc and sys files
    ------------------
    /sys/devices/system/node/online: 0,2
    /proc/sys/kernel/numa_balancing: 1
    /sys/devices/system/node/has_cpu: 2
    /sys/devices/system/node/has_memory: 2
    /sys/devices/system/node/has_normal_memory: 2
    /sys/devices/system/node/possible: 0-31

    v5.8 + patch
    ------------------
    available: 1 nodes (2)
    node 2 cpus: 0 1 2 3 4 5 6 7
    node 2 size: 32625 MB
    node 2 free: 31487 MB
    node distances:
    node 2
    2: 10

    proc and sys files
    ------------------
    /sys/devices/system/node/online: 2
    /proc/sys/kernel/numa_balancing: 0
    /sys/devices/system/node/has_cpu: 2
    /sys/devices/system/node/has_memory: 2
    /sys/devices/system/node/has_normal_memory: 2
    /sys/devices/system/node/possible: 0-31

    Example of a node with online CPUs/memory on node 0.
    (Same o/p with and without patch)
    numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
    node 0 size: 32482 MB
    node 0 free: 22994 MB
    node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
    node 1 size: 0 MB
    node 1 free: 0 MB
    node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
    node 2 size: 0 MB
    node 2 free: 0 MB
    node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
    node 3 free: 0 MB
    node distances:
    node 0 1 2 3
    0: 10 20 40 40
    1: 20 10 40 40
    2: 40 40 10 20
    3: 40 40 20 10

    Note: On Powerpc, cpu_to_node of possible but not present cpus would
    previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
    numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
    queried from vphn"). Without the 2 commits, Powerpc system might crash.

    1. User space applications like Numactl, lscpu, that parse the sysfs tend to
    believe there is an extra online node. This tends to confuse users and
    applications. Other user space applications start believing that system was
    not able to use all the resources (i.e missing resources) or the system was
    not setup correctly.

    2. Also existence of dummy node also leads to inconsistent information. The
    number of online nodes is inconsistent with the information in the
    device-tree and resource-dump

    3. When the dummy node is present, single node non-Numa systems end up showing
    up as NUMA systems and numa_balancing gets enabled. This will mean we take
    the hit from the unnecessary numa hinting faults.

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • Node id queried from the static device tree may not
    be correct. For example: it may always show 0 on a shared processor.
    Hence prefer the node id queried from vphn and fallback on the device tree
    based node id if vphn query fails.

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • A Powerpc system with multiple possible nodes and with CONFIG_NUMA
    enabled always used to have a node 0, even if node 0 does not any cpus
    or memory attached to it. As per PAPR, node affinity of a cpu is only
    available once its present / online. For all cpus that are possible but
    not present, cpu_to_node() would point to node 0.

    To ensure a cpuless, memoryless dummy node is not online, powerpc need
    to make sure all possible but not present cpu_to_node are set to a
    proper node.

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
    Abstraction Services (RTAS) Node" available at:
    https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf

    ... there are 2 device tree properties:

    "ibm,max-associativity-domains"
    which defines the maximum number of domains that the firmware i.e
    PowerVM can support.

    and:

    "ibm,current-associativity-domains"
    which defines the maximum number of domains that the current
    platform can support.

    The value of "ibm,max-associativity-domains" is always greater than or
    equal to "ibm,current-associativity-domains" property. If the latter
    property is not available, use "ibm,max-associativity-domain" as a
    fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
    is mentioned in page 833 / B.5.3 which is covered under under
    "Appendix B. System Binding" section

    Currently powerpc uses the "ibm,max-associativity-domains" property
    while setting the possible number of nodes. This is currently set at
    32. However the possible number of nodes for a platform may be
    significantly less. Hence set the possible number of nodes based on
    "ibm,current-associativity-domains" property.

    Nathan Lynch had raised a valid concern that post LPM (Live Partition
    Migration), a user could DLPAR add processors and memory after LPM
    with "new" associativity properties:
    https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u

    He also pointed out that "ibm,max-associativity-domains" has the same
    contents on all currently available PowerVM systems, unlike
    "ibm,current-associativity-domains" and hence may be better able to
    handle the new NUMA associativity properties.

    However with the recent commit dbce45628085 ("powerpc/numa: Limit
    possible nodes to within num_possible_nodes"), all new NUMA
    associativity properties are capped to initially set nr_node_ids.
    Hence this commit should be safe with any new DLPAR add post LPM.

    $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
    /proc/device-tree/rtas/ibm,current-associativity-domains
    00000005 00000001 00000002 00000002 00000002 00000010
    /proc/device-tree/rtas/ibm,max-associativity-domains
    00000005 00000001 00000008 00000020 00000020 00000100

    $ cat /sys/devices/system/node/possible ##Before patch
    0-31

    $ cat /sys/devices/system/node/possible ##After patch
    0-1

    Note the maximum nodes this platform can support is only 2 but the
    possible nodes is set to 32.

    This is important because lot of kernel and user space code allocate
    structures for all possible nodes leading to a lot of memory that is
    allocated but not used.

    I ran a simple experiment to create and destroy 100 memory cgroups on
    boot on a 8 node machine (Power8 Alpine).

    Before patch:
    free -k at boot
    total used free shared buff/cache available
    Mem: 523498176 4106816 518820608 22272 570752 516606720
    Swap: 4194240 0 4194240

    free -k after creating 100 memory cgroups
    total used free shared buff/cache available
    Mem: 523498176 4628416 518246464 22336 623296 516058688
    Swap: 4194240 0 4194240

    free -k after destroying 100 memory cgroups
    total used free shared buff/cache available
    Mem: 523498176 4697408 518173760 22400 627008 515987904
    Swap: 4194240 0 4194240

    After patch:
    free -k at boot
    total used free shared buff/cache available
    Mem: 523498176 3969472 518933888 22272 594816 516731776
    Swap: 4194240 0 4194240

    free -k after creating 100 memory cgroups
    total used free shared buff/cache available
    Mem: 523498176 4181888 518676096 22208 640192 516496448
    Swap: 4194240 0 4194240

    free -k after destroying 100 memory cgroups
    total used free shared buff/cache available
    Mem: 523498176 4232320 518619904 22272 645952 516443264
    Swap: 4194240 0 4194240

    Observations:
    Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
    Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

    Signed-off-by: Srikar Dronamraju
    [mpe: Reformat change log a bit for readability]
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com

    Srikar Dronamraju
     
  • Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
    single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
    a process under certain conditions. One of the assumptions is that
    mm_users would not be incremented via a reference outside the process
    context with mmget_not_zero() then go on to kthread_use_mm() via that
    reference.

    That invariant was broken by io_uring code (see previous sparc64 fix),
    but I'll point Fixes: to the original powerpc commit because we are
    changing that assumption going forward, so this will make backports
    match up.

    Fix this by no longer relying on that assumption, but by having each CPU
    check the mm is not being used, and clearing their own bit from the mask
    only if it hasn't been switched-to by the time the IPI is processed.

    This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
    invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
    switch sequences.

    Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Michael Ellerman
    Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com

    Nicholas Piggin
     

15 Sep, 2020

4 commits

  • This ensures we don't do a partial mapping of memory. With nvdimm, when
    creating namespaces with size not aligned to 16MB, the kernel ends up partially
    mapping the pages. This can result in kernel adding multiple hash page table
    entries for the same range. A new namespace will result in
    create_section_mapping() with start and end overlapping an already existing
    bolted hash page table entry.

    commit: 6acd7d5ef264 ("libnvdimm/namespace: Enforce memremap_compat_align()")
    made sure that we always create namespaces aligned to 16MB. But we can do
    better by avoiding mapping pages that are not aligned. This helps to catch
    access to these partially mapped pages early.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200907072539.67310-1-aneesh.kumar@linux.ibm.com

    Aneesh Kumar K.V
     
  • Before the commit identified below, pages tables allocation was
    performed after the allocation of final shadow area for linear memory.
    But that commit switched the order, leading to page tables being
    already allocated at the time 8xx kasan_init_shadow_8M() is called.
    Due to this, kasan_init_shadow_8M() doesn't map the needed
    shadow entries because there are already page tables.

    kasan_init_shadow_8M() installs huge PMD entries instead of page
    tables. We could at that time free the page tables, but there is no
    point in creating page tables that get freed before being used.

    Only book3s/32 hash needs early allocation of page tables. For other
    variants, we can keep the initial order and create remaining page
    tables after the allocation of final shadow memory for linear mem.

    Move back the allocation of shadow page tables for
    CONFIG_KASAN_VMALLOC into kasan_init() after the loop which creates
    final shadow memory for linear mem.

    Fixes: 41ea93cf7ba4 ("powerpc/kasan: Fix shadow pages allocation failure")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/8ae4554357da4882612644a74387ae05525b2aaa.1599800716.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • The 8xx has 4 page sizes: 4k, 16k, 512k and 8M

    4k and 16k can be selected at build time as standard page sizes,
    and 512k and 8M are hugepages.

    When 4k standard pages are selected, 16k pages are not available.

    Allow 16k pages as hugepages when 4k pages are used.

    To allow that, implement arch_make_huge_pte() which receives
    the necessary arguments to allow setting the PTE in accordance
    with the page size:
    - 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
    by pte_mkhuge(). arch_make_huge_pte() does nothing.
    - 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
    _PAGE_HUGE.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/a518abc29266a708dfbccc8fce9ae6694fe4c2c6.1598862623.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • On 8xx, the number of entries occupied by a PTE in the page tables
    depends on the size of the page. At the time being, this calculation
    is done in two places: in pte_update() and in set_huge_pte_at()

    Refactor this calculation into a helper called
    number_of_cells_per_pte(). For the time being, the val param is
    unused. It will be used by following patch.

    Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
    declaration.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/f6ea2483c2c389567b007945948f704d18cfaeea.1598862623.git.christophe.leroy@csgroup.eu

    Christophe Leroy