13 Mar, 2020

1 commit

  • Pull networking fixes from David Miller:
    "It looks like a decent sized set of fixes, but a lot of these are one
    liner off-by-one and similar type changes:

    1) Fix netlink header pointer to calcular bad attribute offset
    reported to user. From Pablo Neira Ayuso.

    2) Don't double clear PHY interrupts when ->did_interrupt is set,
    from Heiner Kallweit.

    3) Add missing validation of various (devlink, nl802154, fib, etc.)
    attributes, from Jakub Kicinski.

    4) Missing *pos increments in various netfilter seq_next ops, from
    Vasily Averin.

    5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

    6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

    7) Work around FMAN erratum A050385, from Madalin Bucur.

    8) Make sure ARP header is pulled early enough in bonding driver,
    from Eric Dumazet.

    9) Do a cond_resched() during multicast processing of ipvlan and
    macvlan, from Mahesh Bandewar.

    10) Don't attach cgroups to unrelated sockets when in interrupt
    context, from Shakeel Butt.

    11) Fix tpacket ring state management when encountering unknown GSO
    types. From Willem de Bruijn.

    12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
    only in the suspend context. From Heiner Kallweit"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
    net: systemport: fix index check to avoid an array out of bounds access
    tc-testing: add ETS scheduler to tdc build configuration
    net: phy: fix MDIO bus PM PHY resuming
    net: hns3: clear port base VLAN when unload PF
    net: hns3: fix RMW issue for VLAN filter switch
    net: hns3: fix VF VLAN table entries inconsistent issue
    net: hns3: fix "tc qdisc del" failed issue
    taprio: Fix sending packets without dequeueing them
    net: mvmdio: avoid error message for optional IRQ
    net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
    net: memcg: fix lockdep splat in inet_csk_accept()
    s390/qeth: implement smarter resizing of the RX buffer pool
    s390/qeth: refactor buffer pool code
    s390/qeth: use page pointers to manage RX buffer pool
    seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
    net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
    net/packet: tpacket_rcv: do not increment ring index on drop
    sxgbe: Fix off by one in samsung driver strncpy size arg
    net: caif: Add lockdep expression to RCU traversal primitive
    MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
    ...

    Linus Torvalds
     

11 Mar, 2020

2 commits

  • If a TCP socket is allocated in IRQ context or cloned from unassociated
    (i.e. not associated to a memcg) in IRQ context then it will remain
    unassociated for its whole life. Almost half of the TCPs created on the
    system are created in IRQ context, so, memory used by such sockets will
    not be accounted by the memcg.

    This issue is more widespread in cgroup v1 where network memory
    accounting is opt-in but it can happen in cgroup v2 if the source socket
    for the cloning was created in root memcg.

    To fix the issue, just do the association of the sockets at the accept()
    time in the process context and then force charge the memory buffer
    already used and reserved by the socket.

    Signed-off-by: Shakeel Butt
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shakeel Butt
     
  • We are testing network memory accounting in our setup and noticed
    inconsistent network memory usage and often unrelated cgroups network
    usage correlates with testing workload. On further inspection, it
    seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
    irq context specially for cgroup v1.

    mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
    and kind of assumes that this can only happen from sk_clone_lock()
    and the source sock object has already associated cgroup. However in
    cgroup v1, where network memory accounting is opt-in, the source sock
    can be unassociated with any cgroup and the new cloned sock can get
    associated with unrelated interrupted cgroup.

    Cgroup v2 can also suffer if the source sock object was created by
    process in the root cgroup or if sk_alloc() is called in irq context.
    The fix is to just do nothing in interrupt.

    WARNING: Please note that about half of the TCP sockets are allocated
    from the IRQ context, so, memory used by such sockets will not be
    accouted by the memcg.

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

    CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
    Hardware name: ...
    Call Trace:

    dump_stack+0x57/0x75
    mem_cgroup_sk_alloc+0xe9/0xf0
    sk_clone_lock+0x2a7/0x420
    inet_csk_clone_lock+0x1b/0x110
    tcp_create_openreq_child+0x23/0x3b0
    tcp_v6_syn_recv_sock+0x88/0x730
    tcp_check_req+0x429/0x560
    tcp_v6_rcv+0x72d/0xa40
    ip6_protocol_deliver_rcu+0xc9/0x400
    ip6_input+0x44/0xd0
    ? ip6_protocol_deliver_rcu+0x400/0x400
    ip6_rcv_finish+0x71/0x80
    ipv6_rcv+0x5b/0xe0
    ? ip6_sublist_rcv+0x2e0/0x2e0
    process_backlog+0x108/0x1e0
    net_rx_action+0x26b/0x460
    __do_softirq+0x104/0x2a6
    do_softirq_own_stack+0x2a/0x40

    do_softirq.part.19+0x40/0x50
    __local_bh_enable_ip+0x51/0x60
    ip6_finish_output2+0x23d/0x520
    ? ip6table_mangle_hook+0x55/0x160
    __ip6_finish_output+0xa1/0x100
    ip6_finish_output+0x30/0xd0
    ip6_output+0x73/0x120
    ? __ip6_finish_output+0x100/0x100
    ip6_xmit+0x2e3/0x600
    ? ipv6_anycast_cleanup+0x50/0x50
    ? inet6_csk_route_socket+0x136/0x1e0
    ? skb_free_head+0x1e/0x30
    inet6_csk_xmit+0x95/0xf0
    __tcp_transmit_skb+0x5b4/0xb20
    __tcp_send_ack.part.60+0xa3/0x110
    tcp_send_ack+0x1d/0x20
    tcp_rcv_state_process+0xe64/0xe80
    ? tcp_v6_connect+0x5d1/0x5f0
    tcp_v6_do_rcv+0x1b1/0x3f0
    ? tcp_v6_do_rcv+0x1b1/0x3f0
    __release_sock+0x7f/0xd0
    release_sock+0x30/0xa0
    __inet_stream_connect+0x1c3/0x3b0
    ? prepare_to_wait+0xb0/0xb0
    inet_stream_connect+0x3b/0x60
    __sys_connect+0x101/0x120
    ? __sys_getsockopt+0x11b/0x140
    __x64_sys_connect+0x1a/0x20
    do_syscall_64+0x51/0x200
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
    Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
    Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Signed-off-by: David S. Miller

    Shakeel Butt
     

06 Mar, 2020

5 commits

  • Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
    fixed memory hotplug with debug_pagealloc enabled, where onlining a page
    goes through page freeing, which removes the direct mapping. Some arches
    don't like when the page is not mapped in the first place, so
    generic_online_page() maps it first. This is somewhat wasteful, but
    better than special casing page freeing fast paths.

    The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
    it's actually enabled. One has to test debug_pagealloc_enabled() since
    031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
    configurable"), or alternatively debug_pagealloc_enabled_static() since
    8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
    but this is not done.

    As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
    will crash:

    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 0000000000000000 TEID: 0000000000000483
    Fault in home space mode while using kernel ASCE.
    AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
    Oops: 0004 ilc:2 [#1] SMP
    CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
    Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
    0000000000000001 0000000000000000 0000000000000002 0000000000000100
    0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
    000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
    Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
    0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
    >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
    0000001ecd281ba2: 41505008 la %r5,8(%r5)
    0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
    0000001ecd281bac: 1a07 ar %r0,%r7
    0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
    Call Trace:
    [] __kernel_map_pages+0x166/0x188
    [] online_pages_range+0xf6/0x128
    [] walk_system_ram_range+0x7e/0xd8
    [] online_pages+0x2fe/0x3f0
    [] memory_subsys_online+0x8e/0xc0
    [] device_online+0x5a/0xc8
    [] state_store+0x88/0x118
    [] kernfs_fop_write+0xc2/0x200
    [] vfs_write+0x176/0x1e0
    [] ksys_write+0xa2/0x100
    [] system_call+0xd8/0x2c8

    Fix this by checking debug_pagealloc_enabled_static() before calling
    kernel_map_pages(). Backports for kernel before 5.5 should use
    debug_pagealloc_enabled() instead. Also add comments.

    Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
    Reported-by: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Reviewed-by: David Hildenbrand
    Cc:
    Cc: Joonsoo Kim
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • rwlock.h should not be included directly. Instead linux/splinlock.h
    should be included. One thing it does is to break the RT build.

    Signed-off-by: Andrew Morton
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Vitaly Wool
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Jeff Moyer has reported that one of xfstests triggers a warning when run
    on DAX-enabled filesystem:

    WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
    ...
    wp_page_copy+0x98c/0xd50 (unreliable)
    do_wp_page+0xd8/0xad0
    __handle_mm_fault+0x748/0x1b90
    handle_mm_fault+0x120/0x1f0
    __do_page_fault+0x240/0xd70
    do_page_fault+0x38/0xd0
    handle_page_fault+0x10/0x30

    The warning happens on failed __copy_from_user_inatomic() which tries to
    copy data into a CoW page.

    This happens because of race between MADV_DONTNEED and CoW page fault:

    CPU0 CPU1
    handle_mm_fault()
    do_wp_page()
    wp_page_copy()
    do_wp_page()
    madvise(MADV_DONTNEED)
    zap_page_range()
    zap_pte_range()
    ptep_get_and_clear_full()

    __copy_from_user_inatomic()
    sees empty PTE and fails
    WARN_ON_ONCE(1)
    clear_page()

    The solution is to re-try __copy_from_user_inatomic() under PTL after
    checking that PTE is matches the orig_pte.

    The second copy attempt can still fail, like due to non-readable PTE, but
    there's nothing reasonable we can do about, except clearing the CoW page.

    Reported-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Jeff Moyer
    Cc:
    Cc: Justin He
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD
    atomically. But the PMD is read before that with an ordinary memory
    reading. If the THP (transparent huge page) is written between the PMD
    reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause
    data corruption. The race window is quite small, but still possible in
    theory, so need to be fixed.

    The race is fixed via using the return value of pmdp_invalidate() to get
    the original content of PMD, which is a read/modify/write atomic
    operation. So no THP writing can occur in between.

    The race has been introduced when the THP migration support is added in
    the commit 616b8371539a ("mm: thp: enable thp migration in generic path").
    But this fix depends on the commit d52605d7cb30 ("mm: do not lose dirty
    and accessed bits in pmdp_invalidate()"). So it's easy to be backported
    after v4.16. But the race window is really small, so it may be fine not
    to backport the fix at all.

    Signed-off-by: Andrew Morton
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Zi Yan
    Reviewed-by: William Kucharski
    Acked-by: Kirill A. Shutemov
    Cc:
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • : A user reported a bug against a distribution kernel while running a
    : proprietary workload described as "memory intensive that is not swapping"
    : that is expected to apply to mainline kernels. The workload is
    : read/write/modifying ranges of memory and checking the contents. They
    : reported that within a few hours that a bad PMD would be reported followed
    : by a memory corruption where expected data was all zeros. A partial
    : report of the bad PMD looked like
    :
    : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
    : [ 5195.341184] ------------[ cut here ]------------
    : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
    : ....
    : [ 5195.410033] Call Trace:
    : [ 5195.410471] [] change_protection_range+0x7dd/0x930
    : [ 5195.410716] [] change_prot_numa+0x18/0x30
    : [ 5195.410918] [] task_numa_work+0x1fe/0x310
    : [ 5195.411200] [] task_work_run+0x72/0x90
    : [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2
    : [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40
    : [ 5195.411739] [] retint_user+0x8/0x10
    :
    : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
    : was a false detection. The bug does not trigger if automatic NUMA
    : balancing or transparent huge pages is disabled.
    :
    : The bug is due a race in change_pmd_range between a pmd_trans_huge and
    : pmd_nond_or_clear_bad check without any locks held. During the
    : pmd_trans_huge check, a parallel protection update under lock can have
    : cleared the PMD and filled it with a prot_numa entry between the transhuge
    : check and the pmd_none_or_clear_bad check.
    :
    : While this could be fixed with heavy locking, it's only necessary to make
    : a copy of the PMD on the stack during change_pmd_range and avoid races. A
    : new helper is created for this as the check if quite subtle and the
    : existing similar helpful is not suitable. This passed 154 hours of
    : testing (usually triggers between 20 minutes and 24 hours) without
    : detecting bad PMDs or corruption. A basic test of an autonuma-intensive
    : workload showed no significant change in behaviour.

    Although Mel withdrew the patch on the face of LKML comment
    https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
    still open, and we have reports of Linpack test reporting bad residuals
    after the bad PMD warning is observed. In addition to that, bad
    rss-counter and non-zero pgtables assertions are triggered on mm teardown
    for the task hitting the bad PMD.

    host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
    ....
    host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
    host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

    The issue is observed on a v4.18-based distribution kernel, but the race
    window is expected to be applicable to mainline kernels, as well.

    [akpm@linux-foundation.org: fix comment typo, per Rafael]
    Signed-off-by: Andrew Morton
    Signed-off-by: Rafael Aquini
    Signed-off-by: Mel Gorman
    Cc:
    Cc: Zi Yan
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Feb, 2020

1 commit


22 Feb, 2020

5 commits

  • Pull arm64 fixes from Will Deacon:
    "It's all straightforward apart from the changes to mmap()/mremap() in
    relation to their handling of address arguments from userspace with
    non-zero tag bits in the upper byte.

    The change to brk() is necessary to fix a nasty user-visible
    regression in malloc(), but we tightened up mmap() and mremap() at the
    same time because they also allow the user to create virtual aliases
    by accident. It's much less likely than brk() to matter in practice,
    but enforcing the principle of "don't permit the creation of mappings
    using tagged addresses" leads to a straightforward ABI without having
    to worry about the "but what if a crazy program did foo?" aspect of
    things.

    Summary:

    - Fix regression in malloc() caused by ignored address tags in brk()

    - Add missing brackets around argument to untagged_addr() macro

    - Fix clang build when using binutils assembler

    - Fix silly typo in virtual memory map documentation"

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()
    docs: arm64: fix trivial spelling enought to enough in memory.rst
    arm64: memory: Add missing brackets to untagged_addr() macro
    arm64: lse: Fix LSE atomics with LLVM

    Linus Torvalds
     
  • When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
    doesn't work before sparse_init_one_section() is called.

    This leads to a crash when hotplug memory:

    BUG: unable to handle page fault for address: 0000000006400000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:__memset+0x24/0x30
    Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
    RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
    RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
    RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
    R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
    FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    sparse_add_section+0x1c9/0x26a
    __add_pages+0xbf/0x150
    add_pages+0x12/0x60
    add_memory_resource+0xc8/0x210
    __add_memory+0x62/0xb0
    acpi_memory_device_add+0x13f/0x300
    acpi_bus_attach+0xf6/0x200
    acpi_bus_scan+0x43/0x90
    acpi_device_hotplug+0x275/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    We should use memmap as it did.

    On x86 the impact is limited to x86_32 builds, or x86_64 configurations
    that override the default setting for SPARSEMEM_VMEMMAP.

    Other memory hotplug archs (arm64, ia64, and ppc) also default to
    SPARSEMEM_VMEMMAP=y.

    [dan.j.williams@intel.com: changelog update]
    {rppt@linux.ibm.com: changelog update]
    Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Baoquan He
    Acked-by: David Hildenbrand
    Reviewed-by: Baoquan He
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Commit 68600f623d69 ("mm: don't miss the last page because of round-off
    error") makes the scan size round up to @denominator regardless of the
    memory cgroup's state, online or offline. This affects the overall
    reclaiming behavior: the corresponding LRU list is eligible for
    reclaiming only when its size logically right shifted by @sc->priority
    is bigger than zero in the former formula.

    For example, the inactive anonymous LRU list should have at least 0x4000
    pages to be eligible for reclaiming when we have 60/12 for
    swappiness/priority and without taking scan/rotation ratio into account.

    After the roundup is applied, the inactive anonymous LRU list becomes
    eligible for reclaiming when its size is bigger than or equal to 0x1000
    in the same condition.

    (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
    ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

    aarch64 has 512MB huge page size when the base page size is 64KB. The
    memory cgroup that has a huge page is always eligible for reclaiming in
    that case.

    The reclaiming is likely to stop after the huge page is reclaimed,
    meaing the further iteration on @sc->priority and the silbing and child
    memory cgroups will be skipped. The overall behaviour has been changed.
    This fixes the issue by applying the roundup to offlined memory cgroups
    only, to give more preference to reclaim memory from offlined memory
    cgroup. It sounds reasonable as those memory is unlikedly to be used by
    anyone.

    The issue was found by starting up 8 VMs on a Ampere Mustang machine,
    which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
    2GB memory. It took 264 seconds for all VMs to be completely up and
    784MB swap is consumed after that. With this patch applied, it took 236
    seconds and 60MB swap to do same thing. So there is 10% performance
    improvement for my case. Note that KSM is disable while THP is enabled
    in the testing.

    total used free shared buff/cache available
    Mem: 16196 10065 2049 16 4081 3749
    Swap: 8175 784 7391
    total used free shared buff/cache available
    Mem: 16196 11324 3656 24 1215 2936
    Swap: 8175 60 8115

    Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
    Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
    Signed-off-by: Gavin Shan
    Acked-by: Roman Gushchin
    Cc: [4.20+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • for_each_mem_cgroup() increases css reference counter for memory cgroup
    and requires to use mem_cgroup_iter_break() if the walk is cancelled.

    Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
    Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
    Signed-off-by: Vasily Averin
    Acked-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • claim_swapfile now always takes i_rwsem.

    Link: http://lkml.kernel.org/r/20200114161225.309792-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

20 Feb, 2020

1 commit

  • Currently the arm64 kernel ignores the top address byte passed to brk(),
    mmap() and mremap(). When the user is not aware of the 56-bit address
    limit or relies on the kernel to return an error, untagging such
    pointers has the potential to create address aliases in user-space.
    Passing a tagged address to munmap(), madvise() is permitted since the
    tagged pointer is expected to be inside an existing mapping.

    The current behaviour breaks the existing glibc malloc() implementation
    which relies on brk() with an address beyond 56-bit to be rejected by
    the kernel.

    Remove untagging in the above functions by partially reverting commit
    ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
    addition, update the arm64 tagged-address-abi.rst document accordingly.

    Link: https://bugzilla.redhat.com/1797052
    Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
    Cc: # 5.4.x-
    Cc: Florian Weimer
    Reviewed-by: Andrew Morton
    Reported-by: Victor Stinner
    Acked-by: Will Deacon
    Acked-by: Andrey Konovalov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas
     

19 Feb, 2020

1 commit


09 Feb, 2020

2 commits

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

08 Feb, 2020

3 commits


07 Feb, 2020

2 commits


04 Feb, 2020

17 commits

  • Merge more updates from Andrew Morton:
    "The rest of MM and the rest of everything else: hotfixes, ipc, misc,
    procfs, lib, cleanups, arm"

    * emailed patches from Andrew Morton : (67 commits)
    ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()
    treewide: remove redundant IS_ERR() before error code check
    include/linux/cpumask.h: don't calculate length of the input string
    lib: new testcases for bitmap_parse{_user}
    lib: rework bitmap_parse()
    lib: make bitmap_parse_user a wrapper on bitmap_parse
    lib: add test for bitmap_parse()
    bitops: more BITS_TO_* macros
    lib/string: add strnchrnul()
    proc: convert everything to "struct proc_ops"
    proc: decouple proc from VFS with "struct proc_ops"
    asm-generic/tlb: provide MMU_GATHER_TABLE_FREE
    asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
    asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
    asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
    asm-generic/tlb: add missing CONFIG symbol
    asm-gemeric/tlb: remove stray function declarations
    asm-generic/tlb: avoid potential double flush
    mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
    powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
    ...

    Linus Torvalds
     
  • Pull drm ttm/mm updates from Dave Airlie:
    "Thomas Hellstrom has some more changes to the TTM layer that needed a
    patch to the mm subsystem.

    This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack
    that has limitations in the TTM layer"

    * tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm:
    mm, drm/ttm: Fix vm page protection handling
    mm: Add a vmf_insert_mixed_prot() function

    Linus Torvalds
     
  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • As described in the comment, the correct order for freeing pages is:

    1) unhook page
    2) TLB invalidate page
    3) free page

    This order equally applies to page directories.

    Currently there are two correct options:

    - use tlb_remove_page(), when all page directores are full pages and
    there are no futher contraints placed by things like software
    walkers (HAVE_FAST_GUP).

    - use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
    architecture does not do IPI based TLB invalidate and has
    HAVE_FAST_GUP (or software TLB fill).

    This however leaves architectures that don't have page based directories
    but don't need RCU in a bind. For those, provide MMU_GATHER_TABLE_FREE,
    which provides the independent batching for directories without the
    additional RCU freeing.

    Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    [akpm@linux-foundation.org: fix sparc64 Kconfig]
    Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Architectures for which we have hardware walkers of Linux page table
    should flush TLB on mmu gather batch allocation failures and batch flush.
    Some architectures like POWER supports multiple translation modes (hash
    and radix) and in the case of POWER only radix translation mode needs the
    above TLBI. This is because for hash translation mode kernel wants to
    avoid this extra flush since there are no hardware walkers of linux page
    table. With radix translation, the hardware also walks linux page table
    and with that, kernel needs to make sure to TLB invalidate page walk cache
    before page table pages are freed.

    More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
    TLB caches for RCU_TABLE_FREE")

    The changes to sparc are to make sure we keep the old behavior since we
    are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
    tlb_needs_table_invalidate is to always force an invalidate and sparc can
    avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
    false for sparc architecture.

    Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
    Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Michael Ellerman [powerpc]
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • struct mm_struct is quite large (~1664 bytes) and so allocating on the
    stack may cause problems as the kernel stack size is small.

    Since ptdump_walk_pgd_level_core() was only allocating the structure so
    that it could modify the pgd argument we can instead introduce a pgd
    override in struct mm_walk and pass this down the call stack to where it
    is needed.

    Since the correct mm_struct is now being passed down, it is now also
    unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
    now take the semaphore on the real mm.

    [steven.price@arm.com: restore missed arm64 changes]
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Signed-off-by: Steven Price
    Reported-by: Stephen Rothwell
    Cc: Catalin Marinas
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
    let's change the meaning of 'level' in note_page() since that makes the
    code simplier.

    Note that for x86, the level numbers were previously increased by 1 in
    commit 45dcd2091363 ("x86/mm/dump_pagetables: Fix printout of p4d level")
    and the comment "Bit 7 has a different meaning" was not updated, so this
    change also makes the code match the comment again.

    Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.com
    Signed-off-by: Steven Price
    Reviewed-by: Catalin Marinas
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Add a generic version of page table dumping that architectures can opt-in
    to.

    Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • The pte_hole() callback is called at multiple levels of the page tables.
    Code dumping the kernel page tables needs to know what at what depth the
    missing entry is. Add this is an extra parameter to pte_hole(). When the
    depth isn't know (e.g. processing a vma) then -1 is passed.

    The depth that is reported is the actual level where the entry is missing
    (ignoring any folding that is in place), i.e. any levels where
    PTRS_PER_P?D is set to 1 are ignored.

    Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
    natural numbers as levels 2/3/4.

    Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
    Signed-off-by: Steven Price
    Tested-by: Zong Li
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • If walk_pte_range() is called with a 'end' argument that is beyond the
    last page of memory (e.g. ~0UL) then the comparison between 'addr' and
    'end' will always fail and the loop will be infinite. Instead change the
    comparison to >= while accounting for overflow.

    Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • walk_page_range_novma() can be used to walk page tables or the kernel or
    for firmware. These page tables may contain entries that are not backed
    by a struct page and so it isn't (in general) possible to take the PTE
    lock for the pte_entry() callback. So update walk_pte_range() to only
    take the lock when no_vma==false by splitting out the inner loop to a
    separate function and add a comment explaining the difference to
    walk_page_range_novma().

    Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
    vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
    because it lacks a vma.

    This means each arch has re-implemented page table walking when needed,
    for example in the per-arch ptdump walker.

    Remove the requirement to have a vma in the generic code and add a new
    function walk_page_range_novma() which ignores the VMAs and simply walks
    the page tables.

    Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
    ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
    users. We're about to add users so reintroduce them, along with
    p4d_entry() as we now have 5 levels of tables.

    Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
    transparent hugepages") already re-added pud_entry() but with different
    semantics to the other callbacks. This commit reverts the semantics back
    to match the other callbacks.

    To support hmm.c which now uses the new semantics of pud_entry() a new
    member ('action') of struct mm_walk is added which allows the callbacks to
    either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
    repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
    pud_trans_huge_lock() itself and make use of the splitting/retry logic of
    the core code.

    After this change pud_entry() is called for all entries, not just
    transparent huge pages.

    [arnd@arndb.de: fix unused variable warning]
    Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
    Signed-off-by: Steven Price
    Signed-off-by: Arnd Bergmann
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Since 5.5-rc1 the last user of this function is gone, so remove the
    functionality.

    See commit
    2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
    for details.

    Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
    Signed-off-by: Florian Westphal
    Acked-by: Andrew Morton
    Acked-by: David Rientjes
    Reviewed-by: David Hildenbrand
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Florian Westphal