20 Jul, 2019

11 commits

  • This fell into disrepair a while ago, and the majority of hits to the
    snapshots were from bots, so it's more trouble to keep running than it's worth.

    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Pull tracing fix from Steven Rostedt:
    "Eiichi Tsukata found a small bug from the fixup of the stack code

    Removing ULONG_MAX as the marker for the user stack trace end, made
    the tracing code not know where the end is. The end is now marked with
    a zero (NULL) pointer. Eiichi fixed this in the tracing code"

    * tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix user stack trace "??" output

    Linus Torvalds
     
  • Pull arch/csky pupdates from Guo Ren:
    "This round of csky subsystem gives two features (ASID algorithm
    update, Perf pmu record support) and some fixups.

    ASID updates:
    - Revert mmu ASID mechanism
    - Add new asid lib code from arm
    - Use generic asid algorithm to implement switch_mm
    - Improve tlb operation with help of asid

    Perf pmu record support:
    - Init pmu as a device
    - Add count-width property for csky pmu
    - Add pmu interrupt support
    - Fix perf record in kernel/user space
    - dt-bindings: Add csky PMU bindings

    Fixes:
    - Fixup no panic in kernel for some traps
    - Fixup some error count in 810 & 860.
    - Fixup abiv1 memset error"

    * tag 'csky-for-linus-5.3-rc1' of git://github.com/c-sky/csky-linux:
    csky: Fixup abiv1 memset error
    csky: Improve tlb operation with help of asid
    csky: Use generic asid algorithm to implement switch_mm
    csky: Add new asid lib code from arm
    csky: Revert mmu ASID mechanism
    dt-bindings: csky: Add csky PMU bindings
    dt-bindings: interrupt-controller: Update csky mpintc
    csky: Fixup some error count in 810 & 860.
    csky: Fix perf record in kernel/user space
    csky: Add pmu interrupt support
    csky: Add count-width property for csky pmu
    csky: Init pmu as a device
    csky: Fixup no panic in kernel for some traps
    csky: Select intc & timer drivers

    Linus Torvalds
     
  • Pull xen updates from Juergen Gross:
    "Fixes and features:

    - A series to introduce a common command line parameter for disabling
    paravirtual extensions when running as a guest in virtualized
    environment

    - A fix for int3 handling in Xen pv guests

    - Removal of the Xen-specific tmem driver as support of tmem in Xen
    has been dropped (and it was experimental only)

    - A security fix for running as Xen dom0 (XSA-300)

    - A fix for IRQ handling when offlining cpus in Xen guests

    - Some small cleanups"

    * tag 'for-linus-5.3a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen: let alloc_xenballooned_pages() fail if not enough memory free
    xen/pv: Fix a boot up hang revealed by int3 self test
    x86/xen: Add "nopv" support for HVM guest
    x86/paravirt: Remove const mark from x86_hyper_xen_hvm variable
    xen: Map "xen_nopv" parameter to "nopv" and mark it obsolete
    x86: Add "nopv" parameter to disable PV extensions
    x86/xen: Mark xen_hvm_need_lapic() and xen_x2apic_para_available() as __init
    xen: remove tmem driver
    Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
    xen/events: fix binding user event channels to cpus

    Linus Torvalds
     
  • Pull iomap split/cleanup from Darrick Wong:
    "As promised, here's the second part of the iomap merge for 5.3, in
    which we break up iomap.c into smaller files grouped by functional
    area so that it'll be easier in the long run to maintain cohesiveness
    of code units and to review incoming patches. There are no functional
    changes and fs/iomap.c split cleanly.

    Summary:

    - Regroup the fs/iomap.c code by major functional area so that we can
    start development for 5.4 from a more stable base"

    * tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    iomap: move internal declarations into fs/iomap/
    iomap: move the main iteration code into a separate file
    iomap: move the buffered IO code into a separate file
    iomap: move the direct IO code into a separate file
    iomap: move the SEEK_HOLE code into a separate file
    iomap: move the file mapping reporting code into a separate file
    iomap: move the swapfile code into a separate file
    iomap: start moving code to fs/iomap/

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted stuff"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    perf_event_get(): don't bother with fget_raw()
    vfs: update d_make_root() description

    Linus Torvalds
     
  • Pull adfs updates from Al Viro:
    "More ADFS patches from Russell King"

    * 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/adfs: add time stamp and file type helpers
    fs/adfs: super: limit idlen according to directory type
    fs/adfs: super: fix use-after-free bug
    fs/adfs: super: safely update options on remount
    fs/adfs: super: correct superblock flags
    fs/adfs: clean up indirect disc addresses and fragment IDs
    fs/adfs: clean up error message printing
    fs/adfs: use %pV for error messages
    fs/adfs: use format_version from disc_record
    fs/adfs: add helper to get filesystem size
    fs/adfs: add helper to get discrecord from map
    fs/adfs: correct disc record structure

    Linus Torvalds
     
  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Fix AF_XDP cq entry leak, from Ilya Maximets.

    2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.

    3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.

    4) Fix handling of neigh timers wrt. entries added by userspace, from
    Lorenzo Bianconi.

    5) Various cases of missing of_node_put(), from Nishka Dasgupta.

    6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.

    7) Various RDS layer fixes, from Gerd Rausch.

    8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
    Cong Wang.

    9) Fix FIB source validation checks over loopback, also from Cong Wang.

    10) Use promisc for unsupported number of filters, from Justin Chen.

    11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    tcp: fix tcp_set_congestion_control() use from bpf hook
    ag71xx: fix return value check in ag71xx_probe()
    ag71xx: fix error return code in ag71xx_probe()
    usb: qmi_wwan: add D-Link DWM-222 A2 device ID
    bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
    net: dsa: sja1105: Fix missing unlock on error in sk_buff()
    gve: replace kfree with kvfree
    selftests/bpf: fix test_xdp_noinline on s390
    selftests/bpf: fix "valid read map access into a read-only array 1" on s390
    net/mlx5: Replace kfree with kvfree
    MAINTAINERS: update netsec driver
    ipv6: Unlink sibling route in case of failure
    liquidio: Replace vmalloc + memset with vzalloc
    udp: Fix typo in net/ipv4/udp.c
    net: bcmgenet: use promisc for unsupported filters
    ipv6: rt6_check should return NULL if 'from' is NULL
    tipc: initialize 'validated' field of received packets
    selftests: add a test case for rp_filter
    fib: relax source validation check for loopback packets
    mlxsw: spectrum: Do not process learned records with a dummy FID
    ...

    Linus Torvalds
     
  • Merge yet more updates from Andrew Morton:
    "The rest of MM and a kernel-wide procfs cleanup.

    Summary of the more significant patches:

    - Patch series "mm/memory_hotplug: Factor out memory block
    devicehandling", v3. David Hildenbrand.

    Some spring-cleaning of the memory hotplug code, notably in
    drivers/base/memory.c

    - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
    Shi.

    Fix /proc/pid/smaps output for THP pages used in shmem.

    - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

    Bugfix and speedup for kernel/resource.c

    - Patch series "mm: Further memory block device cleanups", David
    Hildenbrand.

    More spring-cleaning of the memory hotplug code.

    - Patch series "mm: Sub-section memory hotplug support". Dan
    Williams.

    Generalise the memory hotplug code so that pmem can use it more
    completely. Then remove the hacks from the libnvdimm code which
    were there to work around the memory-hotplug code's constraints.

    - "proc/sysctl: add shared variables for range check", Matteo Croce.

    We have about 250 instances of

    int zero;
    ...
    .extra1 = &zero,

    in the tree. This is a tree-wide sweep to make all those private
    "zero"s and "one"s use global variables.

    Alas, it isn't practical to make those two global integers const"

    * emailed patches from Andrew Morton : (38 commits)
    proc/sysctl: add shared variables for range check
    mm: migrate: remove unused mode argument
    mm/sparsemem: cleanup 'section number' data types
    libnvdimm/pfn: stop padding pmem namespaces to section alignment
    libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
    mm/devm_memremap_pages: enable sub-section remap
    mm: document ZONE_DEVICE memory-model implications
    mm/sparsemem: support sub-section hotplug
    mm/sparsemem: prepare for sub-section ranges
    mm: kill is_dev_zone() helper
    mm/hotplug: kill is_dev_zone() usage in __remove_pages()
    mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
    mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
    mm/sparsemem: add helpers track active portions of a section at boot
    mm/sparsemem: introduce a SECTION_IS_EARLY flag
    mm/sparsemem: introduce struct mem_section_usage
    drivers/base/memory.c: get rid of find_memory_block_hinted()
    mm/memory_hotplug: move and simplify walk_memory_blocks()
    mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
    mm: make register_mem_sect_under_node() static
    ...

    Linus Torvalds
     
  • Commit c5c27a0a5838 ("x86/stacktrace: Remove the pointless ULONG_MAX
    marker") removes ULONG_MAX marker from user stack trace entries but
    trace_user_stack_print() still uses the marker and it outputs unnecessary
    "??".

    For example:

    less-1911 [001] d..2 34.758944:
    =>
    => ??
    => ??
    => ??
    => ??
    => ??
    => ??
    => ??

    The user stack trace code zeroes the storage before saving the stack, so if
    the trace is shorter than the maximum number of entries it can terminate
    the print loop if a zero entry is detected.

    Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com

    Cc: stable@vger.kernel.org
    Fixes: 4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
    Signed-off-by: Eiichi Tsukata
    Signed-off-by: Steven Rostedt (VMware)

    Eiichi Tsukata
     

19 Jul, 2019

29 commits

  • Current memset implementation in abiv1 is wrong and it'll cause unalign
    access. Just remove it and use the generic one. This patch will cause
    performance degradation and we will improve it with a new design in next
    patchset.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • There are two generations of tlb operation instruction for C-SKY.
    First generation is use mcr register and it need software do more
    things, second generation is use specific instructions, eg:
    tlbi.va, tlbi.vas, tlbi.alls

    We implemented the following functions:

    - flush_tlb_range (a range of entries)
    - flush_tlb_page (one entry)

    Above functions use asid from vma->mm to invalid tlb entries and
    we could use tlbi.vas instruction for newest generation csky cpu.

    - flush_tlb_kernel_range
    - flush_tlb_one

    Above functions don't care asid and it invalid the tlb entries only
    with vpn and we could use tlbi.vaas instruction for newest generat-
    ion csky cpu.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • Use linux generic asid/vmid algorithm to implement csky
    switch_mm function. The algorithm is from arm and it could
    work with SMP system. It'll help reduce tlb flush for
    switch_mm in task/vm switch.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • This patch only contains asid help code from arm for next patch to
    use.

    The asid allocator use five level check to reduce the cost of
    switch_mm.

    1. Check if the asid version is the same (it's general)
    2. Check reserved_asid which is set in rollover flush_context()
    and key point is to keep the same bit position with the current
    asid version instead of input version.
    3. Check if the position of bitmap is free then it could be set &
    used directly.
    4. find_next_zero_bit() (a little performance cost)
    5. flush_context (this is the worst cost with increase current asid
    version)

    Check is level by level and cost is also higher with the next level.
    The reserved_asid and bitmap mechanism prevent unnecessary
    find_next_zero_bit().

    The atomic 64 bit asid is also suitable for 32-bit system and it
    won't cost a lot in 1th 2th 3th level check.

    The operation of set/clear mm_cpumask was removed in arm64 compared to
    arm32. It seems no side effect on current arm64 system, but from
    software meaning it's wrong. Although csky also needn't it, we add it
    back for csky.

    The asid_per_ctxt is no use for csky and it reserves the lowest bits for
    other use, maybe: trust zone ? Ok, just keep it in csky copy.

    Seems it also could be used by other archs and it's worth to move asid
    code to generic in future.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann
    Cc: Julien Grall

    Guo Ren
     
  • Current C-SKY ASID mechanism is from mips and it doesn't work well
    with multi-cores. ASID per core mechanism is not suitable for C-SKY
    SMP tlb maintain operations, eg: tlbi.vas need share the same asid
    in all processors and it'll invalid the tlb entry in all cores with
    the same asid.

    This patch is prepare for new ASID mechanism.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • This patch adds the documentation to describe that how to add pmu node in
    dts.

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren
    Cc: Rob Herring

    Mao Han
     
  • Add trigger type setting for csky,mpintc. The driver also could
    support #interrupt-cells and it wouldn't invalidate existing
    DTs. Here we only show the complete format.

    Signed-off-by: Guo Ren
    Reviewed-by: Rob Herring
    Cc: Marc Zyngier

    Guo Ren
     
  • CK810 pmu only support event with index 0-8 and 0xd; CK860 only
    support event 1~4, 0xa~0x1b. So do not register unsupport event
    to hardware cache event, which may leader to unknown behavior.

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren

    Guo Ren
     
  • csky_pmu_event_init is called several times during the perf record
    initialzation. After configure the event counter in either kernel
    space or user space, csky_pmu_event_init is called twice with no
    attr specified. Configuration will be overwritten with sampling in
    both kernel space and user space. --all-kernel/--all-user is
    useless without this patch applied.

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren

    Mao Han
     
  • This patch add interrupt request and handler for csky pmu.
    perf can record on hardware event with this patch applied.

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren

    Mao Han
     
  • The csky pmu counter may have different io width. When the counter is
    smaller then 64 bits and counter value is smaller than the old value, it
    will result to a extremely large delta value. So the sampled value should
    be extend to 64 bits to avoid this, the extension bits base on the
    count-width property from dts.

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren

    Mao Han
     
  • This patch change the csky pmu initialization from arch init to
    device init. The pmu can be configued with information from
    device tree(pmu device name, irq number and etc.).

    Signed-off-by: Mao Han
    Signed-off-by: Guo Ren

    Mao Han
     
  • These traps couldn't be hanppen in kernel and we must panic there not
    send a signal to userspace.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • Let arch help to select interrupt controller's and timer's drivers
    instead of people using menuconfig to select. This help the mini system
    boot up.

    Signed-off-by: Guo Ren
    Cc: Arnd Bergmann

    Guo Ren
     
  • Neal reported incorrect use of ns_capable() from bpf hook.

    bpf_setsockopt(...TCP_CONGESTION...)
    -> tcp_set_congestion_control()
    -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
    -> current_cred()
    -> rcu_dereference_protected(current->cred, 1)

    Accessing 'current' in bpf context makes no sense, since packets
    are processed from softirq context.

    As Neal stated : The capability check in tcp_set_congestion_control()
    was written assuming a system call context, and then was reused from
    a BPF call site.

    The fix is to add a new parameter to tcp_set_congestion_control(),
    so that the ns_capable() call is only performed under the right
    context.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Eric Dumazet
    Cc: Lawrence Brakmo
    Reported-by: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In case of error, the function of_get_mac_address() returns ERR_PTR()
    and never returns NULL. The NULL test in the return value check should
    be replaced with IS_ERR().

    Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
    Signed-off-by: Wei Yongjun
    Reviewed-by: Oleksij Rempel
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • Fix to return error code -ENOMEM from the dmam_alloc_coherent() error
    handling case instead of 0, as done elsewhere in this function.

    Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
    Signed-off-by: Wei Yongjun
    Reviewed-by: Oleksij Rempel
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     
  • migrate_page_move_mapping() doesn't use the mode argument. Remove it
    and update callers accordingly.

    Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
    Signed-off-by: Keith Busch
    Reviewed-by: Zi Yan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Busch
     
  • David points out that there is a mixture of 'int' and 'unsigned long'
    usage for section number data types. Update the memory hotplug path to
    use 'unsigned long' consistently for section numbers.

    [akpm@linux-foundation.org: fix printk format]
    Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
    memory, we no longer need to add padding at pfn/dax device creation
    time. The kernel will still honor padding established by older kernels.

    Link: http://lkml.kernel.org/r/156092356588.979959.6793371748950931916.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Logan Gunthorpe
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • At namespace creation time there is the potential for the "expected to
    be zero" fields of a 'pfn' info-block to be filled with indeterminate
    data. While the kernel buffer is zeroed on allocation it is immediately
    overwritten by nd_pfn_validate() filling it with the current contents of
    the on-media info-block location. For fields like, 'flags' and the
    'padding' it potentially means that future implementations can not rely on
    those fields being zero.

    In preparation to stop using the 'start_pad' and 'end_trunc' fields for
    section alignment, arrange for fields that are not explicitly
    initialized to be guaranteed zero. Bump the minor version to indicate
    it is safe to assume the 'padding' and 'flags' are zero. Otherwise,
    this corruption is expected to benign since all other critical fields
    are explicitly initialized.

    Note The cc: stable is about spreading this new policy to as many
    kernels as possible not fixing an issue in those kernels. It is not
    until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
    section alignment" where this improper initialization becomes a problem.
    So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
    namespaces to section alignment" (which is not tagged for stable), make
    sure this pre-requisite is flagged.

    Link: http://lkml.kernel.org/r/156092356065.979959.6681003754765958296.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
    Signed-off-by: Dan Williams
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc:
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Logan Gunthorpe
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Teach devm_memremap_pages() about the new sub-section capabilities of
    arch_{add,remove}_memory(). Effectively, just replace all usage of
    align_start, align_end, and align_size with res->start, res->end, and
    resource_size(res). The existing sanity check will still make sure that
    the two separate remap attempts do not collide within a sub-section (2MB
    on x86).

    Link: http://lkml.kernel.org/r/156092355542.979959.10060071713397030576.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Toshi Kani
    Cc: Jérôme Glisse
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users
    of 'devm_memremap_pages()'.

    [dan.j.williams@intel.com: update ZONE_DEVICE memory model documentation]
    Link: http://lkml.kernel.org/r/156109575458.1409767.1885676287099277666.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/156092354985.979959.15763234410543451710.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Mike Rapoport
    Reviewed-by: Mike Rapoport
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Jonathan Corbet
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Logan Gunthorpe
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare the memory hot-{add,remove} paths for handling sub-section
    ranges by plumbing the starting page frame and number of pages being
    handled through arch_{add,remove}_memory() to
    sparse_{add,remove}_one_section().

    This is simply plumbing, small cleanups, and some identifier renames.
    No intended functional changes.

    Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Given there are no more usages of is_dev_zone() outside of 'ifdef
    CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

    Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Wei Yang
    Acked-by: David Hildenbrand
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The zone type check was a leftover from the cleanup that plumbed altmap
    through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
    vmem_altmap to arch_remove_memory and __remove_pages".

    Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Allow sub-section sized ranges to be added to the memmap.

    populate_section_memmap() takes an explict pfn range rather than
    assuming a full section, and those parameters are plumbed all the way
    through to vmmemap_populate(). There should be no sub-section usage in
    current deployments. New warnings are added to clarify which memmap
    allocation paths are sub-section capable.

    Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams