01 Apr, 2020

25 commits

  • [ Upstream commit 872307abbd0d9afd72171929806c2fa33dc34179 ]

    Check clk_prepare_enable() return value.

    Fixes: 2c7230446bc9 ("net: phy: Add pm support to Broadcom iProc mdio mux driver")
    Signed-off-by: Rayagonda Kokatanur
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rayagonda Kokatanur
     
  • [ Upstream commit c312c7818b86b663d32ec5d4b512abf06b23899a ]

    The DT binding for this PHY describes an *optional* clock property.
    Due to a bug in the error handling logic, we are actually ignoring this
    clock *all* of the time so far.

    Fix this by using devm_clk_get_optional() to handle this clock properly.

    Fixes: b78ac6ecd1b6b ("net: phy: mdio-bcm-unimac: Allow configuring MDIO clock divider")
    Signed-off-by: Andre Przywara
    Reviewed-by: Andrew Lunn
    Acked-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     
  • [ Upstream commit 749f6f6843115b424680f1aada3c0dd613ad807c ]

    When the DP83867 PHY is strapped to enable Fast Link Drop (FLD) feature
    STRAP_STS2.STRAP_ FLD (reg 0x006F bit 10), the Energy Lost Threshold for
    FLD Energy Lost Mode FLD_THR_CFG.ENERGY_LOST_FLD_THR (reg 0x002e bits 2:0)
    will be defaulted to 0x2. This may cause the phy link to be unstable. The
    new DP83867 DM recommends to always restore ENERGY_LOST_FLD_THR to 0x1.

    Hence, restore default value of FLD_THR_CFG.ENERGY_LOST_FLD_THR to 0x1 when
    FLD is enabled by bootstrapping as recommended by DM.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Grygorii Strashko
     
  • [ Upstream commit 61fad6816fc10fb8793a925d5c1256d1c3db0cd2 ]

    PACKET_RX_RING can cause multiple writers to access the same slot if a
    fast writer wraps the ring while a slow writer is still copying. This
    is particularly likely with few, large, slots (e.g., GSO packets).

    Synchronize kernel thread ownership of rx ring slots with a bitmap.

    Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
    while holding the sk receive queue lock. They release this lock before
    copying and set tp_status to TP_STATUS_USER to release to userspace
    when done. During copying, another writer may take the lock, also see
    TP_STATUS_KERNEL, and start writing to the same slot.

    Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
    slot, test and set with the lock held. To release race-free, update
    tp_status and owner bit as a transaction, so take the lock again.

    This is the one of a variety of discussed options (see Link below):

    * instead of a shadow ring, embed the data in the slot itself, such as
    in tp_padding. But any test for this field may match a value left by
    userspace, causing deadlock.

    * avoid the lock on release. This leaves a small race if releasing the
    shadow slot before setting TP_STATUS_USER. The below reproducer showed
    that this race is not academic. If releasing the slot after tp_status,
    the race is more subtle. See the first link for details.

    * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
    of two fields. But, legacy applications may interpret all non-zero
    tp_status as owned by the user. As libpcap does. So this is possible
    only opt-in by newer processes. It can be added as an optional mode.

    * embed the struct at the tail of pg_vec to avoid extra allocation.
    The implementation proved no less complex than a separate field.

    The additional locking cost on release adds contention, no different
    than scaling on multicore or multiqueue h/w. In practice, below
    reproducer nor small packet tcpdump showed a noticeable change in
    perf report in cycles spent in spinlock. Where contention is
    problematic, packet sockets support mitigation through PACKET_FANOUT.
    And we can consider adding opt-in state TP_KERNEL_OWNED.

    Easy to reproduce by running multiple netperf or similar TCP_STREAM
    flows concurrently with `tcpdump -B 129 -n greater 60000`.

    Based on an earlier patchset by Jon Rosen. See links below.

    I believe this issue goes back to the introduction of tpacket_rcv,
    which predates git history.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
    Suggested-by: Jon Rosen
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Jon Rosen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 065fd83e1be2e1ba0d446a257fd86a3cc7bddb51 ]

    For the case where the last mvneta_poll did not process all
    RX packets, we need to xor the pp->cause_rx_tx or port->cause_rx_tx
    before claculating the rx_queue.

    Fixes: 2dcf75e2793c ("net: mvneta: Associate RX queues with each CPU")
    Signed-off-by: Jisheng Zhang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jisheng Zhang
     
  • [ Upstream commit 428c491332bca498c8eb2127669af51506c346c7 ]

    Currently ENA only provides the PCI remove() handler, used during rmmod
    for example. This is not called on shutdown/kexec path; we are potentially
    creating a failure scenario on kexec:

    (a) Kexec is triggered, no shutdown() / remove() handler is called for ENA;
    instead pci_device_shutdown() clears the master bit of the PCI device,
    stopping all DMA transactions;

    (b) Kexec reboot happens and the device gets enabled again, likely having
    its FW with that DMA transaction buffered; then it may trigger the (now
    invalid) memory operation in the new kernel, corrupting kernel memory area.

    This patch aims to prevent this, by implementing a shutdown() handler
    quite similar to the remove() one - the difference being the handling
    of the netdev, which is unregistered on remove(), but following the
    convention observed in other drivers, it's only detached on shutdown().

    This prevents an odd issue in AWS Nitro instances, in which after the 2nd
    kexec the next one will fail with an initrd corruption, caused by a wild
    DMA write to invalid kernel memory. The lspci output for the adapter
    present in my instance is:

    00:05.0 Ethernet controller [0200]: Amazon.com, Inc. Elastic Network
    Adapter (ENA) [1d0f:ec20]

    Suggested-by: Gavin Shan
    Signed-off-by: Guilherme G. Piccoli
    Acked-by: Sameeh Jubran
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Guilherme G. Piccoli
     
  • [ Upstream commit e80f40cbe4dd51371818e967d40da8fe305db5e4 ]

    Not only did this wheel did not need reinventing, but there is also
    an issue with it: It doesn't remove the VLAN header in a way that
    preserves the L2 payload checksum when that is being provided by the DSA
    master hw. It should recalculate checksum both for the push, before
    removing the header, and for the pull afterwards. But the current
    implementation is quite dizzying, with pulls followed immediately
    afterwards by pushes, the memmove is done before the push, etc. This
    makes a DSA master with RX checksumming offload to print stack traces
    with the infamous 'hw csum failure' message.

    So remove the dsa_8021q_remove_header function and replace it with
    something that actually works with inet checksumming.

    Fixes: d461933638ae ("net: dsa: tag_8021q: Create helper function for removing VLAN header")
    Signed-off-by: Vladimir Oltean
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Oltean
     
  • [ Upstream commit 22259471b51925353bd7b16f864c79fdd76e425e ]

    Andrew reported:

    After a number of network port link up/down changes, sometimes the switch
    port gets stuck in a state where it thinks it is still transmitting packets
    but the cpu port is not actually transmitting anymore. In this state you
    will see a message on the console
    "mtk_soc_eth 1e100000.ethernet eth0: transmit timed out" and the Tx counter
    in ifconfig will be incrementing on virtual port, but not incrementing on
    cpu port.

    The issue is that MAC TX/RX status has no impact on the link status or
    queue manager of the switch. So the queue manager just queues up packets
    of a disabled port and sends out pause frames when the queue is full.

    Change the LINK bit to reflect the link status.

    Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
    Reported-by: Andrew Smith
    Signed-off-by: René van Dorst
    Reviewed-by: Vivien Didelot
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    René van Dorst
     
  • [ Upstream commit 0e62f543bed03a64495bd2651d4fe1aa4bcb7fe5 ]

    When both the switch and the bridge are learning about new addresses,
    switch ports attached to the bridge would see duplicate ARP frames
    because both entities would attempt to send them.

    Fixes: 5037d532b83d ("net: dsa: add Broadcom tag RX/TX handler")
    Reported-by: Maxime Bizon
    Signed-off-by: Florian Fainelli
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 961d0e5b32946703125964f9f5b6321d60f4d706 ]

    Currently the software CBS does not consider the packet sending time
    when depleting the credits. It caused the throughput to be
    Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where
    Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope
    is expected. In order to fix the issue above, this patch takes the time
    when the packet sending completes into account by moving the anchor time
    variable "last" ahead to the send completion time upon transmission and
    adding wait when the next dequeue request comes before the send
    completion time of the previous packet.

    changelog:
    V2->V3:
    - remove unnecessary whitespace cleanup
    - add the checks if port_rate is 0 before division

    V1->V2:
    - combine variable "send_completed" into "last"
    - add the comment for estimate of the packet sending

    Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
    Signed-off-by: Zh-yuan Ye
    Reviewed-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Zh-yuan Ye
     
  • [ Upstream commit 13d0f7b814d9b4c67e60d8c2820c86ea181e7d99 ]

    The bpfilter UMH code was recently changed to log its informative messages to
    /dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by
    dprintf(). As result dprintf() returns -EINVAL and doesn't log anything.

    However there already had some discussions about supporting SEEK_CUR into
    /dev/kmsg interface in the past it wasn't concluded. Since the only user of
    that from userspace perspective inside the kernel is the bpfilter UMH
    (userspace) module it's better to correct it here instead waiting a conclusion
    on the interface.

    Fixes: 36c4357c63f3 ("net: bpfilter: print umh messages to /dev/kmsg")
    Signed-off-by: Bruno Meneguele
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Bruno Meneguele
     
  • [ Upstream commit f6bf1bafdc2152bb22aff3a4e947f2441a1d49e2 ]

    list_for_each_entry_from_reverse() iterates backwards over the list from
    the current position, but in the error path we should start from the
    previous position.

    Fix this by using list_for_each_entry_continue_reverse() instead.

    This suppresses the following error from coccinelle:

    drivers/net/ethernet/mellanox/mlxsw//spectrum_mr.c:655:34-38: ERROR:
    invalid reference to the index variable of the iterator on line 636

    Fixes: c011ec1bbfd6 ("mlxsw: spectrum: Add the multicast routing offloading logic")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 6002059d7882c3512e6ac52fa82424272ddfcd5c ]

    During initialization the driver issues a software reset command and
    then waits for the system status to change back to "ready" state.

    However, before issuing the reset command the driver does not check that
    the system is actually in "ready" state. On Spectrum-{1,2} systems this
    was always the case as the hardware initialization time is very short.
    On Spectrum-3 systems this is no longer the case. This results in the
    software reset command timing-out and the driver failing to load:

    [ 6.347591] mlxsw_spectrum3 0000:06:00.0: Cmd exec timed-out (opcode=40(ACCESS_REG),opcode_mod=0,in_mod=0)
    [ 6.358382] mlxsw_spectrum3 0000:06:00.0: Reg cmd access failed (reg_id=9023(mrsr),type=write)
    [ 6.368028] mlxsw_spectrum3 0000:06:00.0: cannot register bus device
    [ 6.375274] mlxsw_spectrum3: probe of 0000:06:00.0 failed with error -110

    Fix this by waiting for the system to become ready both before issuing
    the reset command and afterwards. In case of failure, print the last
    system status to aid in debugging.

    Fixes: da382875c616 ("mlxsw: spectrum: Extend to support Spectrum-3 ASIC")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit b06d072ccc4b1acd0147b17914b7ad1caa1818bb ]

    Only attach macsec to ethernet devices.

    Syzbot was able to trigger a KMSAN warning in macsec_handle_frame
    by attaching to a phonet device.

    Macvlan has a similar check in macvlan_port_create.

    v1->v2
    - fix commit message typo

    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit dddeb30bfc43926620f954266fd12c65a7206f07 ]

    There is a place,

    inet_dump_fib()
    fib_table_dump
    fn_trie_dump_leaf()
    hlist_for_each_entry_rcu()

    without rcu_read_lock() will trigger a warning,

    WARNING: suspicious RCU usage
    -----------------------------
    net/ipv4/fib_trie.c:2216 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by ip/1923:
    #0: ffffffff8ce76e40 (rtnl_mutex){+.+.}, at: netlink_dump+0xd6/0x840

    Call Trace:
    dump_stack+0xa1/0xea
    lockdep_rcu_suspicious+0x103/0x10d
    fn_trie_dump_leaf+0x581/0x590
    fib_table_dump+0x15f/0x220
    inet_dump_fib+0x4ad/0x5d0
    netlink_dump+0x350/0x840
    __netlink_dump_start+0x315/0x3e0
    rtnetlink_rcv_msg+0x4d1/0x720
    netlink_rcv_skb+0xf0/0x220
    rtnetlink_rcv+0x15/0x20
    netlink_unicast+0x306/0x460
    netlink_sendmsg+0x44b/0x770
    __sys_sendto+0x259/0x270
    __x64_sys_sendto+0x80/0xa0
    do_syscall_64+0x69/0xf4
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Fixes: 18a8021a7be3 ("net/ipv4: Plumb support for filtering route dumps")
    Signed-off-by: Qian Cai
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Qian Cai
     
  • [ Upstream commit 3a303cfdd28d5f930a307c82e8a9d996394d5ebd ]

    The port->hsr is used in the hsr_handle_frame(), which is a
    callback of rx_handler.
    hsr master and slaves are initialized in hsr_add_port().
    This function initializes several pointers, which includes port->hsr after
    registering rx_handler.
    So, in the rx_handler routine, un-initialized pointer would be used.
    In order to fix this, pointers should be initialized before
    registering rx_handler.

    Test commands:
    ip netns del left
    ip netns del right
    modprobe -rv veth
    modprobe -rv hsr
    killall ping
    modprobe hsr
    ip netns add left
    ip netns add right
    ip link add veth0 type veth peer name veth1
    ip link add veth2 type veth peer name veth3
    ip link add veth4 type veth peer name veth5
    ip link set veth1 netns left
    ip link set veth3 netns right
    ip link set veth4 netns left
    ip link set veth5 netns right
    ip link set veth0 up
    ip link set veth2 up
    ip link set veth0 address fc:00:00:00:00:01
    ip link set veth2 address fc:00:00:00:00:02
    ip netns exec left ip link set veth1 up
    ip netns exec left ip link set veth4 up
    ip netns exec right ip link set veth3 up
    ip netns exec right ip link set veth5 up
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip netns exec left ip link add hsr1 type hsr slave1 veth1 slave2 veth4
    ip netns exec left ip a a 192.168.100.2/24 dev hsr1
    ip netns exec left ip link set hsr1 up
    ip netns exec left ip n a 192.168.100.1 dev hsr1 lladdr \
    fc:00:00:00:00:01 nud permanent
    ip netns exec left ip n r 192.168.100.1 dev hsr1 lladdr \
    fc:00:00:00:00:01 nud permanent
    for i in {1..100}
    do
    ip netns exec left ping 192.168.100.1 &
    done
    ip netns exec left hping3 192.168.100.1 -2 --flood &
    ip netns exec right ip link add hsr2 type hsr slave1 veth3 slave2 veth5
    ip netns exec right ip a a 192.168.100.3/24 dev hsr2
    ip netns exec right ip link set hsr2 up
    ip netns exec right ip n a 192.168.100.1 dev hsr2 lladdr \
    fc:00:00:00:00:02 nud permanent
    ip netns exec right ip n r 192.168.100.1 dev hsr2 lladdr \
    fc:00:00:00:00:02 nud permanent
    for i in {1..100}
    do
    ip netns exec right ping 192.168.100.1 &
    done
    ip netns exec right hping3 192.168.100.1 -2 --flood &
    while :
    do
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip link del hsr0
    done

    Splat looks like:
    [ 120.954938][ C0] general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1]I
    [ 120.957761][ C0] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
    [ 120.959064][ C0] CPU: 0 PID: 1511 Comm: hping3 Not tainted 5.6.0-rc5+ #460
    [ 120.960054][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [ 120.962261][ C0] RIP: 0010:hsr_addr_is_self+0x65/0x2a0 [hsr]
    [ 120.963149][ C0] Code: 44 24 18 70 73 2f c0 48 c1 eb 03 48 8d 04 13 c7 00 f1 f1 f1 f1 c7 40 04 00 f2 f2 f2 4
    [ 120.966277][ C0] RSP: 0018:ffff8880d9c09af0 EFLAGS: 00010206
    [ 120.967293][ C0] RAX: 0000000000000006 RBX: 1ffff1101b38135f RCX: 0000000000000000
    [ 120.968516][ C0] RDX: dffffc0000000000 RSI: ffff8880d17cb208 RDI: 0000000000000000
    [ 120.969718][ C0] RBP: 0000000000000030 R08: ffffed101b3c0e3c R09: 0000000000000001
    [ 120.972203][ C0] R10: 0000000000000001 R11: ffffed101b3c0e3b R12: 0000000000000000
    [ 120.973379][ C0] R13: ffff8880aaf80100 R14: ffff8880aaf800f2 R15: ffff8880aaf80040
    [ 120.974410][ C0] FS: 00007f58e693f740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000
    [ 120.979794][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 120.980773][ C0] CR2: 00007ffcb8b38f29 CR3: 00000000afe8e001 CR4: 00000000000606f0
    [ 120.981945][ C0] Call Trace:
    [ 120.982411][ C0]
    [ 120.982848][ C0] ? hsr_add_node+0x8c0/0x8c0 [hsr]
    [ 120.983522][ C0] ? rcu_read_lock_held+0x90/0xa0
    [ 120.984159][ C0] ? rcu_read_lock_sched_held+0xc0/0xc0
    [ 120.984944][ C0] hsr_handle_frame+0x1db/0x4e0 [hsr]
    [ 120.985597][ C0] ? hsr_nl_nodedown+0x2b0/0x2b0 [hsr]
    [ 120.986289][ C0] __netif_receive_skb_core+0x6bf/0x3170
    [ 120.992513][ C0] ? check_chain_key+0x236/0x5d0
    [ 120.993223][ C0] ? do_xdp_generic+0x1460/0x1460
    [ 120.993875][ C0] ? register_lock_class+0x14d0/0x14d0
    [ 120.994609][ C0] ? __netif_receive_skb_one_core+0x8d/0x160
    [ 120.995377][ C0] __netif_receive_skb_one_core+0x8d/0x160
    [ 120.996204][ C0] ? __netif_receive_skb_core+0x3170/0x3170
    [ ... ]

    Reported-by: syzbot+fcf5dd39282ceb27108d@syzkaller.appspotmail.com
    Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 0fda7600c2e174fe27e9cf02e78e345226e441fa ]

    The debug check must be done after unregister_netdevice_many() call --
    the list_del() for this is done inside .ndo_stop.

    Fixes: 2843a25348f8 ("geneve: speedup geneve tunnels dismantle")
    Reported-and-tested-by:
    Cc: Haishuang Yan
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • [ Upstream commit f1f20a8666c55cb534b8f3fc1130eebf01a06155 ]

    Driver reclaims descriptors in much smaller batches, even if hardware
    indicates more to reclaim, during backpressure. So, fix the check to
    restart the Txq during backpressure, by looking at how many
    descriptors hardware had indicated to reclaim, and not on how many
    descriptors that driver had actually reclaimed. Once the Txq is
    restarted, driver will reclaim even more descriptors when Tx path
    is entered again.

    Fixes: d429005fdf2c ("cxgb4/cxgb4vf: Add support for SGE doorbell queue timer")
    Signed-off-by: Rahul Lakkireddy
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rahul Lakkireddy
     
  • [ Upstream commit 7affd80802afb6ca92dba47d768632fbde365241 ]

    commit 7c3bebc3d868 ("cxgb4: request the TX CIDX updates to status page")
    reverted back to getting Tx CIDX updates via DMA, instead of interrupts,
    introduced by commit d429005fdf2c ("cxgb4/cxgb4vf: Add support for SGE
    doorbell queue timer")

    However, it missed reverting back several code changes where Tx CIDX
    updates are not explicitly requested during backpressure when using
    interrupt mode. These missed changes cause slow recovery during
    backpressure because the corresponding interrupt no longer comes and
    hence results in Tx throughput drop.

    So, revert back these missed code changes, as well, which will allow
    explicitly requesting Tx CIDX updates when backpressure happens.
    This enables the corresponding interrupt with Tx CIDX update message
    to get generated and hence speed up recovery and restore back
    throughput.

    Fixes: 7c3bebc3d868 ("cxgb4: request the TX CIDX updates to status page")
    Fixes: d429005fdf2c ("cxgb4/cxgb4vf: Add support for SGE doorbell queue timer")
    Signed-off-by: Rahul Lakkireddy
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rahul Lakkireddy
     
  • commit 024aa8732acb7d2503eae43c3fe3504d0a8646d0 upstream.

    Note that the EC GPE processing need not be synchronized in
    acpi_s2idle_wake() after invoking acpi_ec_dispatch_gpe(), because
    that function checks the GPE status and dispatches its handler if
    need be and the SCI action handler is not going to run anyway at
    that point.

    Moreover, it is better to drain all of the pending ACPI events
    before restoring the working-state configuration of GPEs in
    acpi_s2idle_restore(), because those events are likely to be related
    to system wakeup, in which case they will not be relevant going
    forward.

    Rework the code to take these observations into account.

    Tested-by: Kenneth R. Crudup
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • [ Upstream commit d2f8bfa4bff5028bc40ed56b4497c32e05b0178f ]

    It has turned out that the sdhci-tegra controller requires the R1B response,
    for commands that has this response associated with them. So, converting
    from an R1B to an R1 response for a CMD6 for example, leads to problems
    with the HW busy detection support.

    Fix this by informing the mmc core about the requirement, via setting the
    host cap, MMC_CAP_NEED_RSP_BUSY.

    Reported-by: Bitan Biswas
    Reported-by: Peter Geis
    Suggested-by: Sowjanya Komatineni
    Cc:
    Tested-by: Sowjanya Komatineni
    Tested-By: Peter Geis
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit 055e04830d4544c57f2a5192a26c9e25915c29c0 ]

    It has turned out that the sdhci-omap controller requires the R1B response,
    for commands that has this response associated with them. So, converting
    from an R1B to an R1 response for a CMD6 for example, leads to problems
    with the HW busy detection support.

    Fix this by informing the mmc core about the requirement, via setting the
    host cap, MMC_CAP_NEED_RSP_BUSY.

    Reported-by: Naresh Kamboju
    Reported-by: Anders Roxell
    Reported-by: Faiz Abbas
    Cc:
    Tested-by: Anders Roxell
    Tested-by: Faiz Abbas
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit 18d200460cd73636d4f20674085c39e32b4e0097 ]

    The busy timeout for the CMD5 to put the eMMC into sleep state, is specific
    to the card. Potentially the timeout may exceed the host->max_busy_timeout.
    If that becomes the case, mmc_sleep() converts from using an R1B response
    to an R1 response, as to prevent the host from doing HW busy detection.

    However, it has turned out that some hosts requires an R1B response no
    matter what, so let's respect that via checking MMC_CAP_NEED_RSP_BUSY. Note
    that, if the R1B gets enforced, the host becomes fully responsible of
    managing the needed busy timeout, in one way or the other.

    Suggested-by: Sowjanya Komatineni
    Cc:
    Link: https://lore.kernel.org/r/20200311092036.16084-1-ulf.hansson@linaro.org
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit 43cc64e5221cc6741252b64bc4531dd1eefb733d ]

    The busy timeout that is computed for each erase/trim/discard operation,
    can become quite long and may thus exceed the host->max_busy_timeout. If
    that becomes the case, mmc_do_erase() converts from using an R1B response
    to an R1 response, as to prevent the host from doing HW busy detection.

    However, it has turned out that some hosts requires an R1B response no
    matter what, so let's respect that via checking MMC_CAP_NEED_RSP_BUSY. Note
    that, if the R1B gets enforced, the host becomes fully responsible of
    managing the needed busy timeout, in one way or the other.

    Suggested-by: Sowjanya Komatineni
    Cc:
    Tested-by: Anders Roxell
    Tested-by: Sowjanya Komatineni
    Tested-by: Faiz Abbas
    Tested-By: Peter Geis
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit 1292e3efb149ee21d8d33d725eeed4e6b1ade963 ]

    It has turned out that some host controllers can't use R1B for CMD6 and
    other commands that have R1B associated with them. Therefore invent a new
    host cap, MMC_CAP_NEED_RSP_BUSY to let them specify this.

    In __mmc_switch(), let's check the flag and use it to prevent R1B responses
    from being converted into R1. Note that, this also means that the host are
    on its own, when it comes to manage the busy timeout.

    Suggested-by: Sowjanya Komatineni
    Cc:
    Tested-by: Anders Roxell
    Tested-by: Sowjanya Komatineni
    Tested-by: Faiz Abbas
    Tested-By: Peter Geis
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     

25 Mar, 2020

15 commits

  • Greg Kroah-Hartman
     
  • commit ae62cf5eb2792d9a818c2d93728ed92119357017 upstream.

    Newer GCC warns about possible truncations of two generated path names as
    we're concatenating the configurable sysfs and debugfs path prefixes
    with a filename and placing the results in buffers of the same size as
    the maximum length of the prefixes.

    snprintf(d->name, MAX_STR_LEN, "gb_loopback%u", dev_id);

    snprintf(d->sysfs_entry, MAX_SYSFS_PATH, "%s%s/",
    t->sysfs_prefix, d->name);

    snprintf(d->debugfs_entry, MAX_SYSFS_PATH, "%sraw_latency_%s",
    t->debugfs_prefix, d->name);

    Fix this by separating the maximum path length from the maximum prefix
    length and reducing the latter enough to fit the generated strings.

    Note that we also need to reduce the device-name buffer size as GCC
    isn't smart enough to figure out that we ever only used MAX_STR_LEN
    bytes of it.

    Fixes: 6b0658f68786 ("greybus: tools: Add tools directory to greybus repo and add loopback")
    Signed-off-by: Johan Hovold
    Link: https://lore.kernel.org/r/20200312110151.22028-4-johan@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Johan Hovold
     
  • commit f16023834863932f95dfad13fac3fc47f77d2f29 upstream.

    Newer GCC warns about a possible truncation of a generated sysfs path
    name as we're concatenating a directory path with a file name and
    placing the result in a buffer that is half the size of the maximum
    length of the directory path (which is user controlled).

    loopback_test.c: In function 'open_poll_files':
    loopback_test.c:651:31: warning: '%s' directive output may be truncated writing up to 511 bytes into a region of size 255 [-Wformat-truncation=]
    651 | snprintf(buf, sizeof(buf), "%s%s", dev->sysfs_entry, "iteration_count");
    | ^~
    loopback_test.c:651:3: note: 'snprintf' output between 16 and 527 bytes into a destination of size 255
    651 | snprintf(buf, sizeof(buf), "%s%s", dev->sysfs_entry, "iteration_count");
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Fix this by making sure the buffer is large enough the concatenated
    strings.

    Fixes: 6b0658f68786 ("greybus: tools: Add tools directory to greybus repo and add loopback")
    Fixes: 9250c0ee2626 ("greybus: Loopback_test: use poll instead of inotify")
    Signed-off-by: Johan Hovold
    Link: https://lore.kernel.org/r/20200312110151.22028-3-johan@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Johan Hovold
     
  • commit e8dca30f7118461d47e1c3510d0e31b277439151 upstream.

    CTA-861-F explicitly states that for RGB colorspace colorimetry should
    be set to "none". Fix that.

    Acked-by: Laurent Pinchart
    Fixes: def23aa7e982 ("drm: bridge: dw-hdmi: Switch to V4L bus format and encodings")
    Signed-off-by: Jernej Skrabec
    Link: https://patchwork.freedesktop.org/patch/msgid/20200304232512.51616-2-jernej.skrabec@siol.net
    Signed-off-by: Greg Kroah-Hartman

    Jernej Skrabec
     
  • commit 98fd5c723730f560e5bea919a64ac5b83d45eb72 upstream.

    When we send PDU data, we want to optimize the tcp stack
    operation if we have more data to send. So when we set MSG_MORE
    when:
    - We have more fragments coming in the batch, or
    - We have a more data to send in this PDU
    - We don't have a data digest trailer
    - We optimize with the SUCCESS flag and omit the NVMe completion
    (used if sq_head pointer update is disabled)

    This addresses a regression in QD=1 with SUCCESS flag optimization
    as we unconditionally set MSG_MORE when we didn't actually have
    more data to send.

    Fixes: 70583295388a ("nvmet-tcp: implement C2HData SUCCESS optimization")
    Reported-by: Mark Wunderlich
    Tested-by: Mark Wunderlich
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • commit f50b7dacccbab2b9e3ef18f52a6dcc18ed2050b9 upstream.

    On a system configured to trigger a crash_kexec() reboot, when only one CPU
    is online and another CPU panics while starting-up, crash_smp_send_stop()
    will fail to send any STOP message to the other already online core,
    resulting in fail to freeze and registers not properly saved.

    Moreover even if the proper messages are sent (case CPUs > 2)
    it will similarly fail to account for the booting CPU when executing
    the final stop wait-loop, so potentially resulting in some CPU not
    been waited for shutdown before rebooting.

    A tangible effect of this behaviour can be observed when, after a panic
    with kexec enabled and loaded, on the following reboot triggered by kexec,
    the cpu that could not be successfully stopped fails to come back online:

    [ 362.291022] ------------[ cut here ]------------
    [ 362.291525] kernel BUG at arch/arm64/kernel/cpufeature.c:886!
    [ 362.292023] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    [ 362.292400] Modules linked in:
    [ 362.292970] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.6.0-rc4-00003-gc780b890948a #105
    [ 362.293136] Hardware name: Foundation-v8A (DT)
    [ 362.293382] pstate: 200001c5 (nzCv dAIF -PAN -UAO)
    [ 362.294063] pc : has_cpuid_feature+0xf0/0x348
    [ 362.294177] lr : verify_local_elf_hwcaps+0x84/0xe8
    [ 362.294280] sp : ffff800011b1bf60
    [ 362.294362] x29: ffff800011b1bf60 x28: 0000000000000000
    [ 362.294534] x27: 0000000000000000 x26: 0000000000000000
    [ 362.294631] x25: 0000000000000000 x24: ffff80001189a25c
    [ 362.294718] x23: 0000000000000000 x22: 0000000000000000
    [ 362.294803] x21: ffff8000114aa018 x20: ffff800011156a00
    [ 362.294897] x19: ffff800010c944a0 x18: 0000000000000004
    [ 362.294987] x17: 0000000000000000 x16: 0000000000000000
    [ 362.295073] x15: 00004e53b831ae3c x14: 00004e53b831ae3c
    [ 362.295165] x13: 0000000000000384 x12: 0000000000000000
    [ 362.295251] x11: 0000000000000000 x10: 00400032b5503510
    [ 362.295334] x9 : 0000000000000000 x8 : ffff800010c7e204
    [ 362.295426] x7 : 00000000410fd0f0 x6 : 0000000000000001
    [ 362.295508] x5 : 00000000410fd0f0 x4 : 0000000000000000
    [ 362.295592] x3 : 0000000000000000 x2 : ffff8000100939d8
    [ 362.295683] x1 : 0000000000180420 x0 : 0000000000180480
    [ 362.296011] Call trace:
    [ 362.296257] has_cpuid_feature+0xf0/0x348
    [ 362.296350] verify_local_elf_hwcaps+0x84/0xe8
    [ 362.296424] check_local_cpu_capabilities+0x44/0x128
    [ 362.296497] secondary_start_kernel+0xf4/0x188
    [ 362.296998] Code: 52805001 72a00301 6b01001f 54000ec0 (d4210000)
    [ 362.298652] SMP: stopping secondary CPUs
    [ 362.300615] Starting crashdump kernel...
    [ 362.301168] Bye!
    [ 0.000000] Booting Linux on physical CPU 0x0000000003 [0x410fd0f0]
    [ 0.000000] Linux version 5.6.0-rc4-00003-gc780b890948a (crimar01@e120937-lin) (gcc version 8.3.0 (GNU Toolchain for the A-profile Architecture 8.3-2019.03 (arm-rel-8.36))) #105 SMP PREEMPT Fri Mar 6 17:00:42 GMT 2020
    [ 0.000000] Machine model: Foundation-v8A
    [ 0.000000] earlycon: pl11 at MMIO 0x000000001c090000 (options '')
    [ 0.000000] printk: bootconsole [pl11] enabled
    .....
    [ 0.138024] rcu: Hierarchical SRCU implementation.
    [ 0.153472] its@2f020000: unable to locate ITS domain
    [ 0.154078] its@2f020000: Unable to locate ITS domain
    [ 0.157541] EFI services will not be available.
    [ 0.175395] smp: Bringing up secondary CPUs ...
    [ 0.209182] psci: failed to boot CPU1 (-22)
    [ 0.209377] CPU1: failed to boot: -22
    [ 0.274598] Detected PIPT I-cache on CPU2
    [ 0.278707] GICv3: CPU2: found redistributor 1 region 0:0x000000002f120000
    [ 0.285212] CPU2: Booted secondary processor 0x0000000001 [0x410fd0f0]
    [ 0.369053] Detected PIPT I-cache on CPU3
    [ 0.372947] GICv3: CPU3: found redistributor 2 region 0:0x000000002f140000
    [ 0.378664] CPU3: Booted secondary processor 0x0000000002 [0x410fd0f0]
    [ 0.401707] smp: Brought up 1 node, 3 CPUs
    [ 0.404057] SMP: Total of 3 processors activated.

    Make crash_smp_send_stop() account also for the online status of the
    calling CPU while evaluating how many CPUs are effectively online: this way
    the right number of STOPs is sent and all other stopped-cores's registers
    are properly saved.

    Fixes: 78fd584cdec05 ("arm64: kdump: implement machine_crash_shutdown()")
    Acked-by: Mark Rutland
    Signed-off-by: Cristian Marussi
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Cristian Marussi
     
  • commit d0bab0c39e32d39a8c5cddca72e5b4a3059fe050 upstream.

    On a system with only one CPU online, when another one CPU panics while
    starting-up, smp_send_stop() will fail to send any STOP message to the
    other already online core, resulting in a system still responsive and
    alive at the end of the panic procedure.

    [ 186.700083] CPU3: shutdown
    [ 187.075462] CPU2: shutdown
    [ 187.162869] CPU1: shutdown
    [ 188.689998] ------------[ cut here ]------------
    [ 188.691645] kernel BUG at arch/arm64/kernel/cpufeature.c:886!
    [ 188.692079] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    [ 188.692444] Modules linked in:
    [ 188.693031] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc4-00001-g338d25c35a98 #104
    [ 188.693175] Hardware name: Foundation-v8A (DT)
    [ 188.693492] pstate: 200001c5 (nzCv dAIF -PAN -UAO)
    [ 188.694183] pc : has_cpuid_feature+0xf0/0x348
    [ 188.694311] lr : verify_local_elf_hwcaps+0x84/0xe8
    [ 188.694410] sp : ffff800011b1bf60
    [ 188.694536] x29: ffff800011b1bf60 x28: 0000000000000000
    [ 188.694707] x27: 0000000000000000 x26: 0000000000000000
    [ 188.694801] x25: 0000000000000000 x24: ffff80001189a25c
    [ 188.694905] x23: 0000000000000000 x22: 0000000000000000
    [ 188.694996] x21: ffff8000114aa018 x20: ffff800011156a38
    [ 188.695089] x19: ffff800010c944a0 x18: 0000000000000004
    [ 188.695187] x17: 0000000000000000 x16: 0000000000000000
    [ 188.695280] x15: 0000249dbde5431e x14: 0262cbe497efa1fa
    [ 188.695371] x13: 0000000000000002 x12: 0000000000002592
    [ 188.695472] x11: 0000000000000080 x10: 00400032b5503510
    [ 188.695572] x9 : 0000000000000000 x8 : ffff800010c80204
    [ 188.695659] x7 : 00000000410fd0f0 x6 : 0000000000000001
    [ 188.695750] x5 : 00000000410fd0f0 x4 : 0000000000000000
    [ 188.695836] x3 : 0000000000000000 x2 : ffff8000100939d8
    [ 188.695919] x1 : 0000000000180420 x0 : 0000000000180480
    [ 188.696253] Call trace:
    [ 188.696410] has_cpuid_feature+0xf0/0x348
    [ 188.696504] verify_local_elf_hwcaps+0x84/0xe8
    [ 188.696591] check_local_cpu_capabilities+0x44/0x128
    [ 188.696666] secondary_start_kernel+0xf4/0x188
    [ 188.697150] Code: 52805001 72a00301 6b01001f 54000ec0 (d4210000)
    [ 188.698639] ---[ end trace 3f12ca47652f7b72 ]---
    [ 188.699160] Kernel panic - not syncing: Attempted to kill the idle task!
    [ 188.699546] Kernel Offset: disabled
    [ 188.699828] CPU features: 0x00004,20c02008
    [ 188.700012] Memory Limit: none
    [ 188.700538] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

    [root@arch ~]# echo Helo
    Helo
    [root@arch ~]# cat /proc/cpuinfo | grep proce
    processor : 0

    Make smp_send_stop() account also for the online status of the calling CPU
    while evaluating how many CPUs are effectively online: this way, the right
    number of STOPs is sent, so enforcing a proper freeze of the system at the
    end of panic even under the above conditions.

    Fixes: 08e875c16a16c ("arm64: SMP support")
    Reported-by: Dave Martin
    Acked-by: Mark Rutland
    Signed-off-by: Cristian Marussi
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Cristian Marussi
     
  • commit 3b36b13d5e69d6f51ff1c55d1b404a74646c9757 upstream.

    Commit 317d9313925c ("ALSA: hda/realtek - Set default power save node to
    0") makes the ALC225 have pop noise on S3 resume and cold boot.

    So partially revert this commit for ALC225 to fix the regression.

    Fixes: 317d9313925c ("ALSA: hda/realtek - Set default power save node to 0")
    BugLink: https://bugs.launchpad.net/bugs/1866357
    Signed-off-by: Kai-Heng Feng
    Link: https://lore.kernel.org/r/20200311061328.17614-1-kai.heng.feng@canonical.com
    Signed-off-by: Takashi Iwai
    Signed-off-by: Greg Kroah-Hartman

    Kai-Heng Feng
     
  • commit 8d67743653dce5a0e7aa500fcccb237cde7ad88e upstream.

    The recent futex inode life time fix changed the ordering of the futex key
    union struct members, but forgot to adjust the hash function accordingly,

    As a result the hashing omits the leading 64bit and even hashes beyond the
    futex key causing a bad hash distribution which led to a ~100% performance
    regression.

    Hand in the futex key pointer instead of a random struct member and make
    the size calculation based of the struct offset.

    Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue")
    Reported-by: Rong Chen
    Decoded-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Tested-by: Rong Chen
    Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream.

    As reported by Jann, ihold() does not in fact guarantee inode
    persistence. And instead of making it so, replace the usage of inode
    pointers with a per boot, machine wide, unique inode identifier.

    This sequence number is global, but shared (file backed) futexes are
    rare enough that this should not become a performance issue.

    Reported-by: Jann Horn
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

    Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path. While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.

    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.

    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:

    * vmalloc_sync_mappings(), and
    * vmalloc_sync_unmappings()

    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized. The only exception is the new call-site added in the
    above mentioned commit.

    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.

    Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: kernel test robot
    Reported-by: Shile Zhang
    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Tested-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki [GHES]
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joerg Roedel
     
  • commit d72520ad004a8ce18a6ba6cde317f0081b27365a upstream.

    Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped
    out") supported writing THP to a swap device but forgot to upgrade an
    older commit df8c94d13c7e ("page-flags: define behavior of FS/IO-related
    flags on compound pages") which could trigger a crash during THP
    swapping out with DEBUG_VM_PGFLAGS=y,

    kernel BUG at include/linux/page-flags.h:317!

    page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
    page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0
    anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked)

    end_swap_bio_write()
    SetPageError(page)
    VM_BUG_ON_PAGE(1 && PageCompound(page))


    bio_endio+0x297/0x560
    dec_pending+0x218/0x430 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x297/0x560
    blk_update_request+0x201/0x920
    scsi_end_request+0x6b/0x4b0
    scsi_io_completion+0x509/0x7e0
    scsi_finish_command+0x1ed/0x2a0
    scsi_softirq_done+0x1c9/0x1d0
    __blk_mqnterrupt+0xf/0x20

    Fix by checking PF_NO_TAIL in those places instead.

    Fixes: bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: "Huang, Ying"
    Acked-by: Rafael Aquini
    Cc:
    Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Qian Cai
     
  • commit 0715e6c516f106ed553828a671d30ad9a3431536 upstream.

    Sachin reports [1] a crash in SLUB __slab_alloc():

    BUG: Kernel NULL pointer dereference on read at 0x000073b0
    Faulting instruction address: 0xc0000000003d55f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in:
    CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
    NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
    REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
    MSR: 8000000000009033 CR: 24004844 XER: 00000000
    CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
    GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
    GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
    GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
    GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
    GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
    GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
    GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
    GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
    NIP ___slab_alloc+0x1f4/0x760
    LR __slab_alloc+0x34/0x60
    Call Trace:
    ___slab_alloc+0x334/0x760 (unreliable)
    __slab_alloc+0x34/0x60
    __kmalloc_node+0x110/0x490
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x108/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2ec/0x4d0
    cgroup_mkdir+0x228/0x5f0
    kernfs_iop_mkdir+0x90/0xf0
    vfs_mkdir+0x110/0x230
    do_mkdirat+0xb0/0x1a0
    system_call+0x5c/0x68

    This is a PowerPC platform with following NUMA topology:

    available: 2 nodes (0-1)
    node 0 cpus:
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    node 1 size: 35247 MB
    node 1 free: 30907 MB
    node distances:
    node 0 1
    0: 10 40
    1: 40 10

    possible numa nodes: 0-31

    This only happens with a mmotm patch "mm/memcontrol.c: allocate
    shrinker_map on appropriate NUMA node" [2] which effectively calls
    kmalloc_node for each possible node. SLUB however only allocates
    kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
    node_to_mem_node to return such valid node for other nodes since commit
    a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
    on memoryless node"). This is however not true in this configuration
    where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
    thus it contains zeroes and get_partial() ends up accessing
    non-allocated kmem_cache_node.

    A related issue was reported by Bharata (originally by Ramachandran) [3]
    where a similar PowerPC configuration, but with mainline kernel without
    patch [2] ends up allocating large amounts of pages by kmalloc-1k
    kmalloc-512. This seems to have the same underlying issue with
    node_to_mem_node() not behaving as expected, and might probably also
    lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].

    This patch should fix both issues by not relying on node_to_mem_node()
    anymore and instead simply falling back to NUMA_NO_NODE, when
    kmalloc_node(node) is attempted for a node that's not online, or has no
    usable memory. The "usable memory" condition is also changed from
    node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
    the condition that SLUB uses to allocate kmem_cache_node structures.
    The check in get_partial() is removed completely, as the checks in
    ___slab_alloc() are now sufficient to prevent get_partial() being
    reached with an invalid node.

    [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
    [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
    [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
    [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/

    Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
    Reported-by: Sachin Sant
    Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Tested-by: Bharata B Rao
    Reviewed-by: Srikar Dronamraju
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Christopher Lameter
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Cc: Vlastimil Babka
    Cc: Nathan Lynch
    Cc:
    Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
    Debugged-by: Srikar Dronamraju
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 5076190daded2197f62fe92cf69674488be44175 upstream.

    This is just a cleanup addition to Jann's fix to properly update the
    transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub:
    add missing TID bump..").

    The transaction ID is what protects us against any concurrent accesses,
    but we should really also make sure to make the 'freelist' comparison
    itself always use the same freelist value that we then used as the new
    next free pointer.

    Jann points out that if we do all of this carefully, we could skip the
    transaction ID update for all the paths that only remove entries from
    the lists, and only update the TID when adding entries (to avoid the ABA
    issue with cmpxchg and list handling re-adding a previously seen value).

    But this patch just does the "make sure to cmpxchg the same value we
    used" rather than then try to be clever.

    Acked-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 1b53734bd0b2feed8e7761771b2e76fc9126ea0c upstream.

    This fixes possible lost wakeup introduced by commit a218cc491420.
    Originally modifications to ep->wq were serialized by ep->wq.lock, but
    in commit a218cc491420 ("epoll: use rwlock in order to reduce
    ep_poll_callback() contention") a new rw lock was introduced in order to
    relax fd event path, i.e. callers of ep_poll_callback() function.

    After the change ep_modify and ep_insert (both are called on epoll_ctl()
    path) were switched to ep->lock, but ep_poll (epoll_wait) was using
    ep->wq.lock on wqueue list modification.

    The bug doesn't lead to any wqueue list corruptions, because wake up
    path and list modifications were serialized by ep->wq.lock internally,
    but actual waitqueue_active() check prior wake_up() call can be
    reordered with modifications of ep ready list, thus wake up can be lost.

    And yes, can be healed by explicit smp_mb():

    list_add_tail(&epi->rdlink, &ep->rdllist);
    smp_mb();
    if (waitqueue_active(&ep->wq))
    wake_up(&ep->wp);

    But let's make it simple, thus current patch replaces ep->wq.lock with
    the ep->lock for wqueue modifications, thus wake up path always observes
    activeness of the wqueue correcty.

    Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
    Reported-by: Max Neunhoeffer
    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Tested-by: Max Neunhoeffer
    Cc: Jakub Kicinski
    Cc: Christopher Kohlhoff
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Jes Sorensen
    Cc: [5.1+]
    Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
    References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
    Bisected-by: Max Neunhoeffer
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Penyaev