20 Jul, 2019
11 commits
-
This fell into disrepair a while ago, and the majority of hits to the
snapshots were from bots, so it's more trouble to keep running than it's worth.Signed-off-by: Dave Jones
Signed-off-by: Linus Torvalds -
Pull tracing fix from Steven Rostedt:
"Eiichi Tsukata found a small bug from the fixup of the stack codeRemoving ULONG_MAX as the marker for the user stack trace end, made
the tracing code not know where the end is. The end is now marked with
a zero (NULL) pointer. Eiichi fixed this in the tracing code"* tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix user stack trace "??" output -
Pull arch/csky pupdates from Guo Ren:
"This round of csky subsystem gives two features (ASID algorithm
update, Perf pmu record support) and some fixups.ASID updates:
- Revert mmu ASID mechanism
- Add new asid lib code from arm
- Use generic asid algorithm to implement switch_mm
- Improve tlb operation with help of asidPerf pmu record support:
- Init pmu as a device
- Add count-width property for csky pmu
- Add pmu interrupt support
- Fix perf record in kernel/user space
- dt-bindings: Add csky PMU bindingsFixes:
- Fixup no panic in kernel for some traps
- Fixup some error count in 810 & 860.
- Fixup abiv1 memset error"* tag 'csky-for-linus-5.3-rc1' of git://github.com/c-sky/csky-linux:
csky: Fixup abiv1 memset error
csky: Improve tlb operation with help of asid
csky: Use generic asid algorithm to implement switch_mm
csky: Add new asid lib code from arm
csky: Revert mmu ASID mechanism
dt-bindings: csky: Add csky PMU bindings
dt-bindings: interrupt-controller: Update csky mpintc
csky: Fixup some error count in 810 & 860.
csky: Fix perf record in kernel/user space
csky: Add pmu interrupt support
csky: Add count-width property for csky pmu
csky: Init pmu as a device
csky: Fixup no panic in kernel for some traps
csky: Select intc & timer drivers -
Pull xen updates from Juergen Gross:
"Fixes and features:- A series to introduce a common command line parameter for disabling
paravirtual extensions when running as a guest in virtualized
environment- A fix for int3 handling in Xen pv guests
- Removal of the Xen-specific tmem driver as support of tmem in Xen
has been dropped (and it was experimental only)- A security fix for running as Xen dom0 (XSA-300)
- A fix for IRQ handling when offlining cpus in Xen guests
- Some small cleanups"
* tag 'for-linus-5.3a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: let alloc_xenballooned_pages() fail if not enough memory free
xen/pv: Fix a boot up hang revealed by int3 self test
x86/xen: Add "nopv" support for HVM guest
x86/paravirt: Remove const mark from x86_hyper_xen_hvm variable
xen: Map "xen_nopv" parameter to "nopv" and mark it obsolete
x86: Add "nopv" parameter to disable PV extensions
x86/xen: Mark xen_hvm_need_lapic() and xen_x2apic_para_available() as __init
xen: remove tmem driver
Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
xen/events: fix binding user event channels to cpus -
Pull iomap split/cleanup from Darrick Wong:
"As promised, here's the second part of the iomap merge for 5.3, in
which we break up iomap.c into smaller files grouped by functional
area so that it'll be easier in the long run to maintain cohesiveness
of code units and to review incoming patches. There are no functional
changes and fs/iomap.c split cleanly.Summary:
- Regroup the fs/iomap.c code by major functional area so that we can
start development for 5.4 from a more stable base"* tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: move internal declarations into fs/iomap/
iomap: move the main iteration code into a separate file
iomap: move the buffered IO code into a separate file
iomap: move the direct IO code into a separate file
iomap: move the SEEK_HOLE code into a separate file
iomap: move the file mapping reporting code into a separate file
iomap: move the swapfile code into a separate file
iomap: start moving code to fs/iomap/ -
Pull misc vfs updates from Al Viro:
"Assorted stuff"* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
perf_event_get(): don't bother with fget_raw()
vfs: update d_make_root() description -
Pull adfs updates from Al Viro:
"More ADFS patches from Russell King"* 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/adfs: add time stamp and file type helpers
fs/adfs: super: limit idlen according to directory type
fs/adfs: super: fix use-after-free bug
fs/adfs: super: safely update options on remount
fs/adfs: super: correct superblock flags
fs/adfs: clean up indirect disc addresses and fragment IDs
fs/adfs: clean up error message printing
fs/adfs: use %pV for error messages
fs/adfs: use format_version from disc_record
fs/adfs: add helper to get filesystem size
fs/adfs: add helper to get discrecord from map
fs/adfs: correct disc record structure -
Pull vfs mount updates from Al Viro:
"The first part of mount updates.Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
... -
Pull networking fixes from David Miller:
1) Fix AF_XDP cq entry leak, from Ilya Maximets.
2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.
3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.
4) Fix handling of neigh timers wrt. entries added by userspace, from
Lorenzo Bianconi.5) Various cases of missing of_node_put(), from Nishka Dasgupta.
6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.
7) Various RDS layer fixes, from Gerd Rausch.
8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
Cong Wang.9) Fix FIB source validation checks over loopback, also from Cong Wang.
10) Use promisc for unsupported number of filters, from Justin Chen.
11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
tcp: fix tcp_set_congestion_control() use from bpf hook
ag71xx: fix return value check in ag71xx_probe()
ag71xx: fix error return code in ag71xx_probe()
usb: qmi_wwan: add D-Link DWM-222 A2 device ID
bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
net: dsa: sja1105: Fix missing unlock on error in sk_buff()
gve: replace kfree with kvfree
selftests/bpf: fix test_xdp_noinline on s390
selftests/bpf: fix "valid read map access into a read-only array 1" on s390
net/mlx5: Replace kfree with kvfree
MAINTAINERS: update netsec driver
ipv6: Unlink sibling route in case of failure
liquidio: Replace vmalloc + memset with vzalloc
udp: Fix typo in net/ipv4/udp.c
net: bcmgenet: use promisc for unsupported filters
ipv6: rt6_check should return NULL if 'from' is NULL
tipc: initialize 'validated' field of received packets
selftests: add a test case for rp_filter
fib: relax source validation check for loopback packets
mlxsw: spectrum: Do not process learned records with a dummy FID
... -
Merge yet more updates from Andrew Morton:
"The rest of MM and a kernel-wide procfs cleanup.Summary of the more significant patches:
- Patch series "mm/memory_hotplug: Factor out memory block
devicehandling", v3. David Hildenbrand.Some spring-cleaning of the memory hotplug code, notably in
drivers/base/memory.c- "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
Shi.Fix /proc/pid/smaps output for THP pages used in shmem.
- "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.
Bugfix and speedup for kernel/resource.c
- Patch series "mm: Further memory block device cleanups", David
Hildenbrand.More spring-cleaning of the memory hotplug code.
- Patch series "mm: Sub-section memory hotplug support". Dan
Williams.Generalise the memory hotplug code so that pmem can use it more
completely. Then remove the hacks from the libnvdimm code which
were there to work around the memory-hotplug code's constraints.- "proc/sysctl: add shared variables for range check", Matteo Croce.
We have about 250 instances of
int zero;
...
.extra1 = &zero,in the tree. This is a tree-wide sweep to make all those private
"zero"s and "one"s use global variables.Alas, it isn't practical to make those two global integers const"
* emailed patches from Andrew Morton : (38 commits)
proc/sysctl: add shared variables for range check
mm: migrate: remove unused mode argument
mm/sparsemem: cleanup 'section number' data types
libnvdimm/pfn: stop padding pmem namespaces to section alignment
libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
mm/devm_memremap_pages: enable sub-section remap
mm: document ZONE_DEVICE memory-model implications
mm/sparsemem: support sub-section hotplug
mm/sparsemem: prepare for sub-section ranges
mm: kill is_dev_zone() helper
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
mm/sparsemem: add helpers track active portions of a section at boot
mm/sparsemem: introduce a SECTION_IS_EARLY flag
mm/sparsemem: introduce struct mem_section_usage
drivers/base/memory.c: get rid of find_memory_block_hinted()
mm/memory_hotplug: move and simplify walk_memory_blocks()
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
mm: make register_mem_sect_under_node() static
... -
Commit c5c27a0a5838 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".For example:
less-1911 [001] d..2 34.758944:
=>
=> ??
=> ??
=> ??
=> ??
=> ??
=> ??
=> ??The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com
Cc: stable@vger.kernel.org
Fixes: 4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata
Signed-off-by: Steven Rostedt (VMware)
19 Jul, 2019
29 commits
-
Current memset implementation in abiv1 is wrong and it'll cause unalign
access. Just remove it and use the generic one. This patch will cause
performance degradation and we will improve it with a new design in next
patchset.Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
There are two generations of tlb operation instruction for C-SKY.
First generation is use mcr register and it need software do more
things, second generation is use specific instructions, eg:
tlbi.va, tlbi.vas, tlbi.allsWe implemented the following functions:
- flush_tlb_range (a range of entries)
- flush_tlb_page (one entry)Above functions use asid from vma->mm to invalid tlb entries and
we could use tlbi.vas instruction for newest generation csky cpu.- flush_tlb_kernel_range
- flush_tlb_oneAbove functions don't care asid and it invalid the tlb entries only
with vpn and we could use tlbi.vaas instruction for newest generat-
ion csky cpu.Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
Use linux generic asid/vmid algorithm to implement csky
switch_mm function. The algorithm is from arm and it could
work with SMP system. It'll help reduce tlb flush for
switch_mm in task/vm switch.Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
This patch only contains asid help code from arm for next patch to
use.The asid allocator use five level check to reduce the cost of
switch_mm.1. Check if the asid version is the same (it's general)
2. Check reserved_asid which is set in rollover flush_context()
and key point is to keep the same bit position with the current
asid version instead of input version.
3. Check if the position of bitmap is free then it could be set &
used directly.
4. find_next_zero_bit() (a little performance cost)
5. flush_context (this is the worst cost with increase current asid
version)Check is level by level and cost is also higher with the next level.
The reserved_asid and bitmap mechanism prevent unnecessary
find_next_zero_bit().The atomic 64 bit asid is also suitable for 32-bit system and it
won't cost a lot in 1th 2th 3th level check.The operation of set/clear mm_cpumask was removed in arm64 compared to
arm32. It seems no side effect on current arm64 system, but from
software meaning it's wrong. Although csky also needn't it, we add it
back for csky.The asid_per_ctxt is no use for csky and it reserves the lowest bits for
other use, maybe: trust zone ? Ok, just keep it in csky copy.Seems it also could be used by other archs and it's worth to move asid
code to generic in future.Signed-off-by: Guo Ren
Cc: Arnd Bergmann
Cc: Julien Grall -
Current C-SKY ASID mechanism is from mips and it doesn't work well
with multi-cores. ASID per core mechanism is not suitable for C-SKY
SMP tlb maintain operations, eg: tlbi.vas need share the same asid
in all processors and it'll invalid the tlb entry in all cores with
the same asid.This patch is prepare for new ASID mechanism.
Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
This patch adds the documentation to describe that how to add pmu node in
dts.Signed-off-by: Mao Han
Signed-off-by: Guo Ren
Cc: Rob Herring -
Add trigger type setting for csky,mpintc. The driver also could
support #interrupt-cells and it wouldn't invalidate existing
DTs. Here we only show the complete format.Signed-off-by: Guo Ren
Reviewed-by: Rob Herring
Cc: Marc Zyngier -
CK810 pmu only support event with index 0-8 and 0xd; CK860 only
support event 1~4, 0xa~0x1b. So do not register unsupport event
to hardware cache event, which may leader to unknown behavior.Signed-off-by: Mao Han
Signed-off-by: Guo Ren -
csky_pmu_event_init is called several times during the perf record
initialzation. After configure the event counter in either kernel
space or user space, csky_pmu_event_init is called twice with no
attr specified. Configuration will be overwritten with sampling in
both kernel space and user space. --all-kernel/--all-user is
useless without this patch applied.Signed-off-by: Mao Han
Signed-off-by: Guo Ren -
This patch add interrupt request and handler for csky pmu.
perf can record on hardware event with this patch applied.Signed-off-by: Mao Han
Signed-off-by: Guo Ren -
The csky pmu counter may have different io width. When the counter is
smaller then 64 bits and counter value is smaller than the old value, it
will result to a extremely large delta value. So the sampled value should
be extend to 64 bits to avoid this, the extension bits base on the
count-width property from dts.Signed-off-by: Mao Han
Signed-off-by: Guo Ren -
This patch change the csky pmu initialization from arch init to
device init. The pmu can be configued with information from
device tree(pmu device name, irq number and etc.).Signed-off-by: Mao Han
Signed-off-by: Guo Ren -
These traps couldn't be hanppen in kernel and we must panic there not
send a signal to userspace.Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
Let arch help to select interrupt controller's and timer's drivers
instead of people using menuconfig to select. This help the mini system
boot up.Signed-off-by: Guo Ren
Cc: Arnd Bergmann -
Neal reported incorrect use of ns_capable() from bpf hook.
bpf_setsockopt(...TCP_CONGESTION...)
-> tcp_set_congestion_control()
-> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
-> ns_capable_common()
-> current_cred()
-> rcu_dereference_protected(current->cred, 1)Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet
Cc: Lawrence Brakmo
Reported-by: Neal Cardwell
Acked-by: Neal Cardwell
Acked-by: Lawrence Brakmo
Signed-off-by: David S. Miller -
In case of error, the function of_get_mac_address() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun
Reviewed-by: Oleksij Rempel
Signed-off-by: David S. Miller -
Fix to return error code -ENOMEM from the dmam_alloc_coherent() error
handling case instead of 0, as done elsewhere in this function.Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun
Reviewed-by: Oleksij Rempel
Signed-off-by: David S. Miller -
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Reviewed-by: Aaron Tomlin
Cc: Matthew Wilcox
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
migrate_page_move_mapping() doesn't use the mode argument. Remove it
and update callers accordingly.Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
Signed-off-by: Keith Busch
Reviewed-by: Zi Yan
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.[akpm@linux-foundation.org: fix printk format]
Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: David Hildenbrand
Reviewed-by: David Hildenbrand
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.Link: http://lkml.kernel.org/r/156092356588.979959.6793371748950931916.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Jeff Moyer
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Logan Gunthorpe
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely on
those fields being zero.In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate
it is safe to assume the 'padding' and 'flags' are zero. Otherwise,
this corruption is expected to benign since all other critical fields
are explicitly initialized.Note The cc: stable is about spreading this new policy to as many
kernels as possible not fixing an issue in those kernels. It is not
until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
section alignment" where this improper initialization becomes a problem.
So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
namespaces to section alignment" (which is not tagged for stable), make
sure this pre-requisite is flagged.Link: http://lkml.kernel.org/r/156092356065.979959.6681003754765958296.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Signed-off-by: Dan Williams
Tested-by: Aneesh Kumar K.V [ppc64]
Cc:
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Logan Gunthorpe
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).Link: http://lkml.kernel.org/r/156092355542.979959.10060071713397030576.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Toshi Kani
Cc: Jérôme Glisse
Cc: Logan Gunthorpe
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users
of 'devm_memremap_pages()'.[dan.j.williams@intel.com: update ZONE_DEVICE memory model documentation]
Link: http://lkml.kernel.org/r/156109575458.1409767.1885676287099277666.stgit@dwillia2-desk3.amr.corp.intel.com
Link: http://lkml.kernel.org/r/156092354985.979959.15763234410543451710.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Mike Rapoport
Reviewed-by: Mike Rapoport
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Jonathan Corbet
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Logan Gunthorpe
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Pavel Tatashin
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Reviewed-by: Wei Yang
Acked-by: David Hildenbrand
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Logan Gunthorpe
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Pavel Tatashin
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Logan Gunthorpe
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds