01 Apr, 2020

40 commits

  • [ Upstream commit 09e91dbea0aa32be02d8877bd50490813de56b9a ]

    The hsr module has been supporting the list and status command.
    (HSR_C_GET_NODE_LIST and HSR_C_GET_NODE_STATUS)
    These commands send node information to the user-space via generic netlink.
    But, in the non-init_net namespace, these commands are not allowed
    because .netnsok flag is false.
    So, there is no way to get node information in the non-init_net namespace.

    Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit ca19c70f5225771c05bcdcb832b4eb84d7271c5e ]

    The hsr_get_node_list() is to send node addresses to the userspace.
    If there are so many nodes, it could fail because of buffer size.
    In order to avoid this failure, the restart routine is added.

    Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 173756b86803655d70af7732079b3aa935e6ab68 ]

    hsr_get_node_{list/status}() are not under rtnl_lock() because
    they are callback functions of generic netlink.
    But they use __dev_get_by_index() without rtnl_lock().
    So, it would use unsafe data.
    In order to fix it, rcu_read_lock() and dev_get_by_index_rcu()
    are used instead of __dev_get_by_index().

    Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 32ca98feab8c9076c89c0697c5a85e46fece809d ]

    The fix referenced below causes a crash when an ERSPAN tunnel is created
    without passing IFLA_INFO_DATA. Fix by validating passed-in data in the
    same way as ipgre does.

    Fixes: e1f8f78ffe98 ("net: ip_gre: Separate ERSPAN newlink / changelink callbacks")
    Reported-by: syzbot+1b4ebf4dae4e510dd219@syzkaller.appspotmail.com
    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Petr Machata
     
  • [ Upstream commit e1f8f78ffe9854308b9e12a73ebe4e909074fc33 ]

    ERSPAN shares most of the code path with GRE and gretap code. While that
    helps keep the code compact, it is also error prone. Currently a broken
    userspace can turn a gretap tunnel into a de facto ERSPAN one by passing
    IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the
    past.

    To prevent these problems in future, split the newlink and changelink code
    paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new
    function erspan_netlink_parms(). Extract a piece of common logic from
    ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup().
    Add erspan_newlink() and erspan_changelink().

    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Petr Machata
     
  • [ Upstream commit 5d765a5e4bd7c368e564e11402bba74cf7f03ac1 ]

    If ring counts are not reset when ring reservation fails,
    bnxt_init_dflt_ring_mode() will not be called again to reinitialise
    IRQs when open() is called and results in system crash as napi will
    also be not initialised. This patch fixes it by resetting the ring
    counts.

    Fixes: 47558acd56a7 ("bnxt_en: Reserve rings at driver open if none was reserved at probe time.")
    Signed-off-by: Vasundhara Volam
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vasundhara Volam
     
  • [ Upstream commit 62bfb932a51f6d08eb409248e69f8d6428c2cabd ]

    Other shutdown code paths will always disable PCI first to shutdown DMA
    before freeing context memory. Do the same sequence in the error path
    of probe to be safe and consistent.

    Fixes: c20dc142dd7b ("bnxt_en: Disable bus master during PCI shutdown and driver unload.")
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michael Chan
     
  • [ Upstream commit 0b5b561cea32d5bb1e0a82d65b755a3cb5212141 ]

    The current code ignores the return value from
    bnxt_hwrm_func_backing_store_cfg(), causing the driver to proceed in
    the init path even when this vital firmware call has failed. Fix it
    by propagating the error code to the caller.

    Fixes: 1b9394e5a2ad ("bnxt_en: Configure context memory on new devices.")
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michael Chan
     
  • [ Upstream commit 62d4073e86e62e316bea2c53e77db10418fd5dd7 ]

    The allocated ieee_ets structure goes out of scope without being freed,
    leaking memory. Appropriate result codes should be returned so that
    callers do not rely on invalid data passed by reference.

    Also cache the ETS config retrieved from the device so that it doesn't
    need to be freed. The balance of the code was clearly written with the
    intent of having the results of querying the hardware cached in the
    device structure. The commensurate store was evidently missed though.

    Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.")
    Signed-off-by: Edwin Peer
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Edwin Peer
     
  • [ Upstream commit a24ec3220f369aa0b94c863b6b310685a727151c ]

    There is an indexing bug in determining these ethtool priority
    counters. Instead of using the queue ID to index, we need to
    normalize by modulo 10 to get the index. This index is then used
    to obtain the proper CoS queue counter. Rename bp->pri2cos to
    bp->pri2cos_idx to make this more clear.

    Fixes: e37fed790335 ("bnxt_en: Add ethtool -S priority counters.")
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michael Chan
     
  • [ Upstream commit 384d91c267e621e0926062cfb3f20cb72dc16928 ]

    gro_cells_init() returns error if memory allocation is failed.
    But the vxlan module doesn't check the return value of gro_cells_init().

    Fixes: 58ce31cca1ff ("vxlan: GRO support at tunnel layer")`
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 6cd6cbf593bfa3ae6fc3ed34ac21da4d35045425 ]

    When application uses TCP_QUEUE_SEQ socket option to
    change tp->rcv_next, we must also update tp->copied_seq.

    Otherwise, stuff relying on tcp_inq() being precise can
    eventually be confused.

    For example, tcp_zerocopy_receive() might crash because
    it does not expect tcp_recv_skb() to return NULL.

    We could add tests in various places to fix the issue,
    or simply make sure tcp_inq() wont return a random value,
    and leave fast path as it is.

    Note that this fixes ioctl(fd, SIOCINQ, &val) at the same
    time.

    Fixes: ee9952831cfd ("tcp: Initial repair mode")
    Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit b738a185beaab8728943acdb3e67371b8a88185e ]

    skb->rbnode is sharing three skb fields : next, prev, dev

    When a packet is sent, TCP keeps the original skb (master)
    in a rtx queue, which was converted to rbtree a while back.

    __tcp_transmit_skb() is responsible to clone the master skb,
    and add the TCP header to the clone before sending it
    to network layer.

    skb_clone() already clears skb->next and skb->prev, but copies
    the master oskb->dev into the clone.

    We need to clear skb->dev, otherwise lower layers could interpret
    the value as a pointer to a netdev.

    This old bug surfaced recently when commit 28f8bfd1ac94
    ("netfilter: Support iif matches in POSTROUTING") was merged.

    Before this netfilter commit, skb->dev value was ignored and
    changed before reaching dev_queue_xmit()

    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING")
    Signed-off-by: Eric Dumazet
    Reported-by: Martin Zaharinov
    Cc: Florian Westphal
    Cc: Pablo Neira Ayuso
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 07f8e4d0fddbf2f87e4cefb551278abc38db8cdd ]

    In rare cases retransmit logic will make a full skb copy, which will not
    trigger the zeroing added in recent change
    b738a185beaa ("tcp: ensure skb->dev is NULL before leaving TCP stack").

    Cc: Eric Dumazet
    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING")
    Signed-off-by: Florian Westphal
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • [ Upstream commit 2091a3d42b4f339eaeed11228e0cbe9d4f92f558 ]

    As the description before netdev_run_todo, we cannot call free_netdev
    before rtnl_unlock, fix it by reorder the code.

    This patch is a 1:1 copy of upstream slip.c commit f596c87005f7
    ("slip: not call free_netdev before rtnl_unlock in slip_open").

    Reported-by: yangerkun
    Signed-off-by: Oliver Hartkopp
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Oliver Hartkopp
     
  • [ Upstream commit f13bc68131b0c0d67a77fb43444e109828a983bf ]

    The original change fixed an issue on RTL8168b by mimicking the vendor
    driver behavior to disable MSI on chip versions before RTL8168d.
    This however now caused an issue on a system with RTL8168c, see [0].
    Therefore leave MSI disabled on RTL8168b, but re-enable it on RTL8168c.

    [0] https://bugzilla.redhat.com/show_bug.cgi?id=1792839

    Fixes: 003bd5b4a7b4 ("r8169: don't use MSI before RTL8168d")
    Signed-off-by: Heiner Kallweit
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Heiner Kallweit
     
  • [ Upstream commit 0dcdf9f64028ec3b75db6b691560f8286f3898bf ]

    The nci_conn_max_data_pkt_payload_size() function sometimes returns
    -EPROTO so "max_size" needs to be signed for the error handling to
    work. We can make "payload_size" an int as well.

    Fixes: a06347c04c13 ("NFC: Add Intel Fields Peak NFC solution driver")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • [ Upstream commit 9de9aa487daff7a5c73434c24269b44ed6a428e6 ]

    Make sure we clean up devicetree related configuration
    also when clock init fails.

    Fixes: fecd4d7eef8b ("net: stmmac: dwmac-rk: Add integrated PHY support")
    Signed-off-by: Emil Renner Berthing
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Emil Renner Berthing
     
  • [ Upstream commit 0d1c3530e1bd38382edef72591b78e877e0edcd3 ]

    In commit 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex")
    I moved cp->hash calculation before the first
    tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched.
    This difference could lead to another out of bound access.

    cp->alloc_hash should always be the size allocated, we should
    update it after this tcindex_alloc_perfect_hash().

    Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com
    Fixes: 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit b1be2e8cd290f620777bfdb8aa00890cd2fa02b5 ]

    syzbot reported a use-after-free in tcindex_dump(). This is due to
    the lack of RTNL in the deferred rcu work. We queue this work with
    RTNL in tcindex_change(), later, tcindex_dump() is called:

    fh = tp->ops->get(tp, t->tcm_handle);
    ...
    err = tp->ops->change(..., &fh, ...);
    tfilter_notify(..., fh, ...);

    but there is nothing to serialize the pending
    tcindex_partial_destroy_work() with tcindex_dump().

    Fix this by simply holding RTNL in tcindex_partial_destroy_work(),
    so that it won't be called until RTNL is released after
    tc_new_tfilter() is completed.

    Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com
    Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit ef299cc3fa1a9e1288665a9fdc8bff55629fd359 ]

    route4_change() allocates a new filter and copies values from
    the old one. After the new filter is inserted into the hash
    table, the old filter should be removed and freed, as the final
    step of the update.

    However, the current code mistakenly removes the new one. This
    looks apparently wrong to me, and it causes double "free" and
    use-after-free too, as reported by syzbot.

    Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com
    Fixes: 1109c00547fc ("net: sched: RCU cls_route")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit dd2af10402684cb5840a127caec9e7cdcff6d167 ]

    Currently, on replace, the previous action instance params
    is swapped with a newly allocated params. The old params is
    only freed (via kfree_rcu), without releasing the allocated
    ct zone template related to it.

    Call tcf_ct_params_free (via call_rcu) for the old params,
    so it will release it.

    Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
    Signed-off-by: Paul Blakey
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paul Blakey
     
  • [ Upstream commit 12a5ba5a1994568d4ceaff9e78c6b0329d953386 ]

    ASKEY WWHC050 is a mcie LTE modem.
    The oem configuration states:

    T: Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 2 Spd=480 MxCh= 0
    D: Ver= 2.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
    P: Vendor=1690 ProdID=7588 Rev=ff.ff
    S: Manufacturer=Android
    S: Product=Android
    S: SerialNumber=813f0eef6e6e
    C:* #Ifs= 6 Cfg#= 1 Atr=80 MxPwr=500mA
    I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
    E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
    E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
    E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
    E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
    E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
    E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    I:* If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
    E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=32ms
    E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=05(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    I:* If#= 5 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=(none)
    E: Ad=89(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
    E: Ad=06(O) Atr=02(Bulk) MxPS= 512 Ivl=125us

    Tested on openwrt distribution.

    Signed-off-by: Cezary Jackiewicz
    Signed-off-by: Pawel Dembicki
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pawel Dembicki
     
  • [ Upstream commit 872307abbd0d9afd72171929806c2fa33dc34179 ]

    Check clk_prepare_enable() return value.

    Fixes: 2c7230446bc9 ("net: phy: Add pm support to Broadcom iProc mdio mux driver")
    Signed-off-by: Rayagonda Kokatanur
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rayagonda Kokatanur
     
  • [ Upstream commit c312c7818b86b663d32ec5d4b512abf06b23899a ]

    The DT binding for this PHY describes an *optional* clock property.
    Due to a bug in the error handling logic, we are actually ignoring this
    clock *all* of the time so far.

    Fix this by using devm_clk_get_optional() to handle this clock properly.

    Fixes: b78ac6ecd1b6b ("net: phy: mdio-bcm-unimac: Allow configuring MDIO clock divider")
    Signed-off-by: Andre Przywara
    Reviewed-by: Andrew Lunn
    Acked-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     
  • [ Upstream commit 749f6f6843115b424680f1aada3c0dd613ad807c ]

    When the DP83867 PHY is strapped to enable Fast Link Drop (FLD) feature
    STRAP_STS2.STRAP_ FLD (reg 0x006F bit 10), the Energy Lost Threshold for
    FLD Energy Lost Mode FLD_THR_CFG.ENERGY_LOST_FLD_THR (reg 0x002e bits 2:0)
    will be defaulted to 0x2. This may cause the phy link to be unstable. The
    new DP83867 DM recommends to always restore ENERGY_LOST_FLD_THR to 0x1.

    Hence, restore default value of FLD_THR_CFG.ENERGY_LOST_FLD_THR to 0x1 when
    FLD is enabled by bootstrapping as recommended by DM.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Grygorii Strashko
     
  • [ Upstream commit 61fad6816fc10fb8793a925d5c1256d1c3db0cd2 ]

    PACKET_RX_RING can cause multiple writers to access the same slot if a
    fast writer wraps the ring while a slow writer is still copying. This
    is particularly likely with few, large, slots (e.g., GSO packets).

    Synchronize kernel thread ownership of rx ring slots with a bitmap.

    Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
    while holding the sk receive queue lock. They release this lock before
    copying and set tp_status to TP_STATUS_USER to release to userspace
    when done. During copying, another writer may take the lock, also see
    TP_STATUS_KERNEL, and start writing to the same slot.

    Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
    slot, test and set with the lock held. To release race-free, update
    tp_status and owner bit as a transaction, so take the lock again.

    This is the one of a variety of discussed options (see Link below):

    * instead of a shadow ring, embed the data in the slot itself, such as
    in tp_padding. But any test for this field may match a value left by
    userspace, causing deadlock.

    * avoid the lock on release. This leaves a small race if releasing the
    shadow slot before setting TP_STATUS_USER. The below reproducer showed
    that this race is not academic. If releasing the slot after tp_status,
    the race is more subtle. See the first link for details.

    * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
    of two fields. But, legacy applications may interpret all non-zero
    tp_status as owned by the user. As libpcap does. So this is possible
    only opt-in by newer processes. It can be added as an optional mode.

    * embed the struct at the tail of pg_vec to avoid extra allocation.
    The implementation proved no less complex than a separate field.

    The additional locking cost on release adds contention, no different
    than scaling on multicore or multiqueue h/w. In practice, below
    reproducer nor small packet tcpdump showed a noticeable change in
    perf report in cycles spent in spinlock. Where contention is
    problematic, packet sockets support mitigation through PACKET_FANOUT.
    And we can consider adding opt-in state TP_KERNEL_OWNED.

    Easy to reproduce by running multiple netperf or similar TCP_STREAM
    flows concurrently with `tcpdump -B 129 -n greater 60000`.

    Based on an earlier patchset by Jon Rosen. See links below.

    I believe this issue goes back to the introduction of tpacket_rcv,
    which predates git history.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
    Suggested-by: Jon Rosen
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Jon Rosen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 065fd83e1be2e1ba0d446a257fd86a3cc7bddb51 ]

    For the case where the last mvneta_poll did not process all
    RX packets, we need to xor the pp->cause_rx_tx or port->cause_rx_tx
    before claculating the rx_queue.

    Fixes: 2dcf75e2793c ("net: mvneta: Associate RX queues with each CPU")
    Signed-off-by: Jisheng Zhang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jisheng Zhang
     
  • [ Upstream commit 428c491332bca498c8eb2127669af51506c346c7 ]

    Currently ENA only provides the PCI remove() handler, used during rmmod
    for example. This is not called on shutdown/kexec path; we are potentially
    creating a failure scenario on kexec:

    (a) Kexec is triggered, no shutdown() / remove() handler is called for ENA;
    instead pci_device_shutdown() clears the master bit of the PCI device,
    stopping all DMA transactions;

    (b) Kexec reboot happens and the device gets enabled again, likely having
    its FW with that DMA transaction buffered; then it may trigger the (now
    invalid) memory operation in the new kernel, corrupting kernel memory area.

    This patch aims to prevent this, by implementing a shutdown() handler
    quite similar to the remove() one - the difference being the handling
    of the netdev, which is unregistered on remove(), but following the
    convention observed in other drivers, it's only detached on shutdown().

    This prevents an odd issue in AWS Nitro instances, in which after the 2nd
    kexec the next one will fail with an initrd corruption, caused by a wild
    DMA write to invalid kernel memory. The lspci output for the adapter
    present in my instance is:

    00:05.0 Ethernet controller [0200]: Amazon.com, Inc. Elastic Network
    Adapter (ENA) [1d0f:ec20]

    Suggested-by: Gavin Shan
    Signed-off-by: Guilherme G. Piccoli
    Acked-by: Sameeh Jubran
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Guilherme G. Piccoli
     
  • [ Upstream commit e80f40cbe4dd51371818e967d40da8fe305db5e4 ]

    Not only did this wheel did not need reinventing, but there is also
    an issue with it: It doesn't remove the VLAN header in a way that
    preserves the L2 payload checksum when that is being provided by the DSA
    master hw. It should recalculate checksum both for the push, before
    removing the header, and for the pull afterwards. But the current
    implementation is quite dizzying, with pulls followed immediately
    afterwards by pushes, the memmove is done before the push, etc. This
    makes a DSA master with RX checksumming offload to print stack traces
    with the infamous 'hw csum failure' message.

    So remove the dsa_8021q_remove_header function and replace it with
    something that actually works with inet checksumming.

    Fixes: d461933638ae ("net: dsa: tag_8021q: Create helper function for removing VLAN header")
    Signed-off-by: Vladimir Oltean
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Oltean
     
  • [ Upstream commit 22259471b51925353bd7b16f864c79fdd76e425e ]

    Andrew reported:

    After a number of network port link up/down changes, sometimes the switch
    port gets stuck in a state where it thinks it is still transmitting packets
    but the cpu port is not actually transmitting anymore. In this state you
    will see a message on the console
    "mtk_soc_eth 1e100000.ethernet eth0: transmit timed out" and the Tx counter
    in ifconfig will be incrementing on virtual port, but not incrementing on
    cpu port.

    The issue is that MAC TX/RX status has no impact on the link status or
    queue manager of the switch. So the queue manager just queues up packets
    of a disabled port and sends out pause frames when the queue is full.

    Change the LINK bit to reflect the link status.

    Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
    Reported-by: Andrew Smith
    Signed-off-by: René van Dorst
    Reviewed-by: Vivien Didelot
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    René van Dorst
     
  • [ Upstream commit 0e62f543bed03a64495bd2651d4fe1aa4bcb7fe5 ]

    When both the switch and the bridge are learning about new addresses,
    switch ports attached to the bridge would see duplicate ARP frames
    because both entities would attempt to send them.

    Fixes: 5037d532b83d ("net: dsa: add Broadcom tag RX/TX handler")
    Reported-by: Maxime Bizon
    Signed-off-by: Florian Fainelli
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 961d0e5b32946703125964f9f5b6321d60f4d706 ]

    Currently the software CBS does not consider the packet sending time
    when depleting the credits. It caused the throughput to be
    Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where
    Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope
    is expected. In order to fix the issue above, this patch takes the time
    when the packet sending completes into account by moving the anchor time
    variable "last" ahead to the send completion time upon transmission and
    adding wait when the next dequeue request comes before the send
    completion time of the previous packet.

    changelog:
    V2->V3:
    - remove unnecessary whitespace cleanup
    - add the checks if port_rate is 0 before division

    V1->V2:
    - combine variable "send_completed" into "last"
    - add the comment for estimate of the packet sending

    Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
    Signed-off-by: Zh-yuan Ye
    Reviewed-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Zh-yuan Ye
     
  • [ Upstream commit 13d0f7b814d9b4c67e60d8c2820c86ea181e7d99 ]

    The bpfilter UMH code was recently changed to log its informative messages to
    /dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by
    dprintf(). As result dprintf() returns -EINVAL and doesn't log anything.

    However there already had some discussions about supporting SEEK_CUR into
    /dev/kmsg interface in the past it wasn't concluded. Since the only user of
    that from userspace perspective inside the kernel is the bpfilter UMH
    (userspace) module it's better to correct it here instead waiting a conclusion
    on the interface.

    Fixes: 36c4357c63f3 ("net: bpfilter: print umh messages to /dev/kmsg")
    Signed-off-by: Bruno Meneguele
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Bruno Meneguele
     
  • [ Upstream commit f6bf1bafdc2152bb22aff3a4e947f2441a1d49e2 ]

    list_for_each_entry_from_reverse() iterates backwards over the list from
    the current position, but in the error path we should start from the
    previous position.

    Fix this by using list_for_each_entry_continue_reverse() instead.

    This suppresses the following error from coccinelle:

    drivers/net/ethernet/mellanox/mlxsw//spectrum_mr.c:655:34-38: ERROR:
    invalid reference to the index variable of the iterator on line 636

    Fixes: c011ec1bbfd6 ("mlxsw: spectrum: Add the multicast routing offloading logic")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 6002059d7882c3512e6ac52fa82424272ddfcd5c ]

    During initialization the driver issues a software reset command and
    then waits for the system status to change back to "ready" state.

    However, before issuing the reset command the driver does not check that
    the system is actually in "ready" state. On Spectrum-{1,2} systems this
    was always the case as the hardware initialization time is very short.
    On Spectrum-3 systems this is no longer the case. This results in the
    software reset command timing-out and the driver failing to load:

    [ 6.347591] mlxsw_spectrum3 0000:06:00.0: Cmd exec timed-out (opcode=40(ACCESS_REG),opcode_mod=0,in_mod=0)
    [ 6.358382] mlxsw_spectrum3 0000:06:00.0: Reg cmd access failed (reg_id=9023(mrsr),type=write)
    [ 6.368028] mlxsw_spectrum3 0000:06:00.0: cannot register bus device
    [ 6.375274] mlxsw_spectrum3: probe of 0000:06:00.0 failed with error -110

    Fix this by waiting for the system to become ready both before issuing
    the reset command and afterwards. In case of failure, print the last
    system status to aid in debugging.

    Fixes: da382875c616 ("mlxsw: spectrum: Extend to support Spectrum-3 ASIC")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit b06d072ccc4b1acd0147b17914b7ad1caa1818bb ]

    Only attach macsec to ethernet devices.

    Syzbot was able to trigger a KMSAN warning in macsec_handle_frame
    by attaching to a phonet device.

    Macvlan has a similar check in macvlan_port_create.

    v1->v2
    - fix commit message typo

    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit dddeb30bfc43926620f954266fd12c65a7206f07 ]

    There is a place,

    inet_dump_fib()
    fib_table_dump
    fn_trie_dump_leaf()
    hlist_for_each_entry_rcu()

    without rcu_read_lock() will trigger a warning,

    WARNING: suspicious RCU usage
    -----------------------------
    net/ipv4/fib_trie.c:2216 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by ip/1923:
    #0: ffffffff8ce76e40 (rtnl_mutex){+.+.}, at: netlink_dump+0xd6/0x840

    Call Trace:
    dump_stack+0xa1/0xea
    lockdep_rcu_suspicious+0x103/0x10d
    fn_trie_dump_leaf+0x581/0x590
    fib_table_dump+0x15f/0x220
    inet_dump_fib+0x4ad/0x5d0
    netlink_dump+0x350/0x840
    __netlink_dump_start+0x315/0x3e0
    rtnetlink_rcv_msg+0x4d1/0x720
    netlink_rcv_skb+0xf0/0x220
    rtnetlink_rcv+0x15/0x20
    netlink_unicast+0x306/0x460
    netlink_sendmsg+0x44b/0x770
    __sys_sendto+0x259/0x270
    __x64_sys_sendto+0x80/0xa0
    do_syscall_64+0x69/0xf4
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Fixes: 18a8021a7be3 ("net/ipv4: Plumb support for filtering route dumps")
    Signed-off-by: Qian Cai
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Qian Cai
     
  • [ Upstream commit 3a303cfdd28d5f930a307c82e8a9d996394d5ebd ]

    The port->hsr is used in the hsr_handle_frame(), which is a
    callback of rx_handler.
    hsr master and slaves are initialized in hsr_add_port().
    This function initializes several pointers, which includes port->hsr after
    registering rx_handler.
    So, in the rx_handler routine, un-initialized pointer would be used.
    In order to fix this, pointers should be initialized before
    registering rx_handler.

    Test commands:
    ip netns del left
    ip netns del right
    modprobe -rv veth
    modprobe -rv hsr
    killall ping
    modprobe hsr
    ip netns add left
    ip netns add right
    ip link add veth0 type veth peer name veth1
    ip link add veth2 type veth peer name veth3
    ip link add veth4 type veth peer name veth5
    ip link set veth1 netns left
    ip link set veth3 netns right
    ip link set veth4 netns left
    ip link set veth5 netns right
    ip link set veth0 up
    ip link set veth2 up
    ip link set veth0 address fc:00:00:00:00:01
    ip link set veth2 address fc:00:00:00:00:02
    ip netns exec left ip link set veth1 up
    ip netns exec left ip link set veth4 up
    ip netns exec right ip link set veth3 up
    ip netns exec right ip link set veth5 up
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip netns exec left ip link add hsr1 type hsr slave1 veth1 slave2 veth4
    ip netns exec left ip a a 192.168.100.2/24 dev hsr1
    ip netns exec left ip link set hsr1 up
    ip netns exec left ip n a 192.168.100.1 dev hsr1 lladdr \
    fc:00:00:00:00:01 nud permanent
    ip netns exec left ip n r 192.168.100.1 dev hsr1 lladdr \
    fc:00:00:00:00:01 nud permanent
    for i in {1..100}
    do
    ip netns exec left ping 192.168.100.1 &
    done
    ip netns exec left hping3 192.168.100.1 -2 --flood &
    ip netns exec right ip link add hsr2 type hsr slave1 veth3 slave2 veth5
    ip netns exec right ip a a 192.168.100.3/24 dev hsr2
    ip netns exec right ip link set hsr2 up
    ip netns exec right ip n a 192.168.100.1 dev hsr2 lladdr \
    fc:00:00:00:00:02 nud permanent
    ip netns exec right ip n r 192.168.100.1 dev hsr2 lladdr \
    fc:00:00:00:00:02 nud permanent
    for i in {1..100}
    do
    ip netns exec right ping 192.168.100.1 &
    done
    ip netns exec right hping3 192.168.100.1 -2 --flood &
    while :
    do
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip link del hsr0
    done

    Splat looks like:
    [ 120.954938][ C0] general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1]I
    [ 120.957761][ C0] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
    [ 120.959064][ C0] CPU: 0 PID: 1511 Comm: hping3 Not tainted 5.6.0-rc5+ #460
    [ 120.960054][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [ 120.962261][ C0] RIP: 0010:hsr_addr_is_self+0x65/0x2a0 [hsr]
    [ 120.963149][ C0] Code: 44 24 18 70 73 2f c0 48 c1 eb 03 48 8d 04 13 c7 00 f1 f1 f1 f1 c7 40 04 00 f2 f2 f2 4
    [ 120.966277][ C0] RSP: 0018:ffff8880d9c09af0 EFLAGS: 00010206
    [ 120.967293][ C0] RAX: 0000000000000006 RBX: 1ffff1101b38135f RCX: 0000000000000000
    [ 120.968516][ C0] RDX: dffffc0000000000 RSI: ffff8880d17cb208 RDI: 0000000000000000
    [ 120.969718][ C0] RBP: 0000000000000030 R08: ffffed101b3c0e3c R09: 0000000000000001
    [ 120.972203][ C0] R10: 0000000000000001 R11: ffffed101b3c0e3b R12: 0000000000000000
    [ 120.973379][ C0] R13: ffff8880aaf80100 R14: ffff8880aaf800f2 R15: ffff8880aaf80040
    [ 120.974410][ C0] FS: 00007f58e693f740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000
    [ 120.979794][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 120.980773][ C0] CR2: 00007ffcb8b38f29 CR3: 00000000afe8e001 CR4: 00000000000606f0
    [ 120.981945][ C0] Call Trace:
    [ 120.982411][ C0]
    [ 120.982848][ C0] ? hsr_add_node+0x8c0/0x8c0 [hsr]
    [ 120.983522][ C0] ? rcu_read_lock_held+0x90/0xa0
    [ 120.984159][ C0] ? rcu_read_lock_sched_held+0xc0/0xc0
    [ 120.984944][ C0] hsr_handle_frame+0x1db/0x4e0 [hsr]
    [ 120.985597][ C0] ? hsr_nl_nodedown+0x2b0/0x2b0 [hsr]
    [ 120.986289][ C0] __netif_receive_skb_core+0x6bf/0x3170
    [ 120.992513][ C0] ? check_chain_key+0x236/0x5d0
    [ 120.993223][ C0] ? do_xdp_generic+0x1460/0x1460
    [ 120.993875][ C0] ? register_lock_class+0x14d0/0x14d0
    [ 120.994609][ C0] ? __netif_receive_skb_one_core+0x8d/0x160
    [ 120.995377][ C0] __netif_receive_skb_one_core+0x8d/0x160
    [ 120.996204][ C0] ? __netif_receive_skb_core+0x3170/0x3170
    [ ... ]

    Reported-by: syzbot+fcf5dd39282ceb27108d@syzkaller.appspotmail.com
    Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 0fda7600c2e174fe27e9cf02e78e345226e441fa ]

    The debug check must be done after unregister_netdevice_many() call --
    the list_del() for this is done inside .ndo_stop.

    Fixes: 2843a25348f8 ("geneve: speedup geneve tunnels dismantle")
    Reported-and-tested-by:
    Cc: Haishuang Yan
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal