11 Jul, 2020

2 commits

  • Pull networking fixes from David Miller:

    1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking
    BPF programs, from Maciej Żenczykowski.

    2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu
    Mariappan.

    3) Slay memory leak in nl80211 bss color attribute parsing code, from
    Luca Coelho.

    4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin.

    5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric
    Dumazet.

    6) xsk code dips too deeply into DMA mapping implementation internals.
    Add dma_need_sync and use it. From Christoph Hellwig

    7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko.

    8) Check for disallowed attributes when loading flow dissector BPF
    programs. From Lorenz Bauer.

    9) Correct packet injection to L3 tunnel devices via AF_PACKET, from
    Jason A. Donenfeld.

    10) Don't advertise checksum offload on ipa devices that don't support
    it. From Alex Elder.

    11) Resolve several issues in TCP MD5 signature support. Missing memory
    barriers, bogus options emitted when using syncookies, and failure
    to allow md5 key changes in established states. All from Eric
    Dumazet.

    12) Fix interface leak in hsr code, from Taehee Yoo.

    13) VF reset fixes in hns3 driver, from Huazhong Tan.

    14) Make loopback work again with ipv6 anycast, from David Ahern.

    15) Fix TX starvation under high load in fec driver, from Tobias
    Waldekranz.

    16) MLD2 payload lengths not checked properly in bridge multicast code,
    from Linus Lüssing.

    17) Packet scheduler code that wants to find the inner protocol
    currently only works for one level of VLAN encapsulation. Allow
    Q-in-Q situations to work properly here, from Toke
    Høiland-Jørgensen.

    18) Fix route leak in l2tp, from Xin Long.

    19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport
    support and various protocols. From Martin KaFai Lau.

    20) Fix socket cgroup v2 reference counting in some situations, from
    Cong Wang.

    21) Cure memory leak in mlx5 connection tracking offload support, from
    Eli Britstein.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
    mlxsw: pci: Fix use-after-free in case of failed devlink reload
    mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON()
    net: macb: fix call to pm_runtime in the suspend/resume functions
    net: macb: fix macb_suspend() by removing call to netif_carrier_off()
    net: macb: fix macb_get/set_wol() when moving to phylink
    net: macb: mark device wake capable when "magic-packet" property present
    net: macb: fix wakeup test in runtime suspend/resume routines
    bnxt_en: fix NULL dereference in case SR-IOV configuration fails
    libbpf: Fix libbpf hashmap on (I)LP32 architectures
    net/mlx5e: CT: Fix memory leak in cleanup
    net/mlx5e: Fix port buffers cell size value
    net/mlx5e: Fix 50G per lane indication
    net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash
    net/mlx5e: Fix VXLAN configuration restore after function reload
    net/mlx5e: Fix usage of rcu-protected pointer
    net/mxl5e: Verify that rpriv is not NULL
    net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode
    net/mlx5: Fix eeprom support for SFP module
    cgroup: Fix sock_cgroup_data on big-endian.
    selftests: bpf: Fix detach from sockmap tests
    ...

    Linus Torvalds
     
  • Pull in-kernel read and write op cleanups from Christoph Hellwig:
    "Cleanup in-kernel read and write operations

    Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
    all users of in-kernel file I/O use them if they don't use iov_iter
    based methods already.

    The new WARN_ONs in combination with syzcaller already found a missing
    input validation in 9p. The fix should be on your way through the
    maintainer ASAP".

    [ This is prep-work for the real changes coming 5.9 ]

    * tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc:
    fs: remove __vfs_read
    fs: implement kernel_read using __kernel_read
    integrity/ima: switch to using __kernel_read
    fs: add a __kernel_read helper
    fs: remove __vfs_write
    fs: implement kernel_write using __kernel_write
    fs: check FMODE_WRITE in __kernel_write
    fs: unexport __kernel_write
    bpfilter: switch to kernel_write
    autofs: switch to kernel_write
    cachefiles: switch to kernel_write

    Linus Torvalds
     

10 Jul, 2020

4 commits

  • Pull kallsyms fix from Kees Cook:
    "Refactor kallsyms_show_value() users for correct cred.

    I'm not delighted by the timing of getting these changes to you, but
    it does fix a handful of kernel address exposures, and no one has
    screamed yet at the patches.

    Several users of kallsyms_show_value() were performing checks not
    during "open". Refactor everything needed to gain proper checks
    against file->f_cred for modules, kprobes, and bpf"

    * tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    selftests: kmod: Add module address visibility test
    bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok()
    kprobes: Do not expose probe addresses to non-CAP_SYSLOG
    module: Do not expose section addresses to non-CAP_SYSLOG
    module: Refactor section attr into bin attribute
    kallsyms: Refactor kallsyms_show_value() to take cred

    Linus Torvalds
     
  • syzkaller found its way into setsockopt with TCP_CONGESTION "cdg".
    tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock
    just copies all the memory, the allocated pointer will be copied as
    well, if the app called setsockopt(..., TCP_CONGESTION) on the listener.
    If now the socket will be destroyed before the congestion-control
    has properly been initialized (through a call to tcp_init_transfer), we
    will end up freeing memory that does not belong to that particular
    socket, opening the door to a double-free:

    [ 11.413102] ==================================================================
    [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.415329]
    [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80
    [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
    [ 11.418148] Call Trace:
    [ 11.418534]
    [ 11.418834] dump_stack+0x7d/0xb0
    [ 11.419297] print_address_description.constprop.0+0x1a/0x210
    [ 11.422079] kasan_report_invalid_free+0x51/0x80
    [ 11.423433] __kasan_slab_free+0x15e/0x170
    [ 11.424761] kfree+0x8c/0x230
    [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100
    [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0
    [ 11.429457] cookie_v4_check+0x13d0/0x2500
    [ 11.433189] tcp_v4_do_rcv+0x60e/0x780
    [ 11.433727] tcp_v4_rcv+0x2869/0x2e10
    [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190
    [ 11.437810] ip_local_deliver+0x294/0x350
    [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0
    [ 11.441995] process_backlog+0x1b1/0x6b0
    [ 11.443148] net_rx_action+0x37e/0xc40
    [ 11.445361] __do_softirq+0x18c/0x61a
    [ 11.445881] asm_call_on_stack+0x12/0x20
    [ 11.446409]
    [ 11.446716] do_softirq_own_stack+0x34/0x40
    [ 11.447259] do_softirq.part.0+0x26/0x30
    [ 11.447827] __local_bh_enable_ip+0x46/0x50
    [ 11.448406] ip_finish_output2+0x60f/0x1bc0
    [ 11.450109] __ip_queue_xmit+0x71c/0x1b60
    [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0
    [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a
    [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780
    [ 11.457995] __release_sock+0x14b/0x2c0
    [ 11.458529] release_sock+0x4a/0x170
    [ 11.459005] __inet_stream_connect+0x467/0xc80
    [ 11.461435] inet_stream_connect+0x4e/0xa0
    [ 11.462043] __sys_connect+0x204/0x270
    [ 11.465515] __x64_sys_connect+0x6a/0xb0
    [ 11.466088] do_syscall_64+0x3e/0x70
    [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.467341] RIP: 0033:0x7f56046dc469
    [ 11.467844] Code: Bad RIP value.
    [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469
    [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004
    [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003
    [ 11.474321]
    [ 11.474527] Allocated by task 4884:
    [ 11.475031] save_stack+0x1b/0x40
    [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0
    [ 11.476182] tcp_cdg_init+0xf0/0x150
    [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0
    [ 11.477435] tcp_set_congestion_control+0x270/0x32f
    [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00
    [ 11.478744] __sys_setsockopt+0xff/0x1e0
    [ 11.479259] __x64_sys_setsockopt+0xb5/0x150
    [ 11.479895] do_syscall_64+0x3e/0x70
    [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.481097]
    [ 11.481321] Freed by task 4872:
    [ 11.481783] save_stack+0x1b/0x40
    [ 11.482230] __kasan_slab_free+0x12c/0x170
    [ 11.482839] kfree+0x8c/0x230
    [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.485144] tcp_close+0x932/0xfe0
    [ 11.485642] inet_release+0xc1/0x1c0
    [ 11.486131] __sock_release+0xc0/0x270
    [ 11.486697] sock_close+0xc/0x10
    [ 11.487145] __fput+0x277/0x780
    [ 11.487632] task_work_run+0xeb/0x180
    [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160
    [ 11.488834] do_syscall_64+0x4a/0x70
    [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750
    ("tcp: memset ca_priv data to 0 properly").

    This patch here fixes the listener-scenario: We make sure that listeners
    setting the congestion-control through setsockopt won't initialize it
    (thus CDG never allocates on listeners). For those who use AF_UNSPEC to
    reuse a socket, tcp_disconnect() is changed to cleanup afterwards.

    (The issue can be reproduced at least down to v4.4.x.)

    Cc: Wei Wang
    Cc: Eric Dumazet
    Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control")
    Signed-off-by: Christoph Paasch
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     
  • If the genlmsg_put() call in ethnl_default_dumpit() fails, we bail out
    without checking if we already have some messages in current skb like we do
    with ethnl_default_dump_one() failure later. Therefore if existing messages
    almost fill up the buffer so that there is not enough space even for
    netlink and genetlink header, we lose all prepared messages and return and
    error.

    Rather than duplicating the skb->len check, move the genlmsg_put(),
    genlmsg_cancel() and genlmsg_end() calls into ethnl_default_dump_one().
    This is also more logical as all message composition will be in
    ethnl_default_dump_one() and only iteration logic will be left in
    ethnl_default_dumpit().

    Fixes: 728480f12442 ("ethtool: default handlers for GET requests")
    Reported-by: Jakub Kicinski
    Signed-off-by: Michal Kubecek
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Michal Kubecek
     
  • When tcf_block_get() fails inside atm_tc_init(),
    atm_tc_put() is called to release the qdisc p->link.q.
    But the flow->ref prevents it to do so, as the flow->ref
    is still zero.

    Fix this by moving the p->link.ref initialization before
    tcf_block_get().

    Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
    Reported-and-tested-by: syzbot+d411cff6ab29cc2c311b@syzkaller.appspotmail.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

09 Jul, 2020

8 commits

  • When evaluating access control over kallsyms visibility, credentials at
    open() time need to be used, not the "current" creds (though in BPF's
    case, this has likely always been the same). Plumb access to associated
    file->f_cred down through bpf_dump_raw_ok() and its callers now that
    kallsysm_show_value() has been refactored to take struct cred.

    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: bpf@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump")
    Signed-off-by: Kees Cook

    Kees Cook
     
  • A scenario has been observed where a 'bc_init' message for a link is not
    retransmitted if it fails to be received by the peer. This leads to the
    peer never establishing the link fully and it discarding all other data
    received on the link. In this scenario the message is lost in transit to
    the peer.

    The issue is traced to the 'nxt_retr' field of the skb not being
    initialised for links that aren't a bc_sndlink. This leads to the
    comparison in tipc_link_advance_transmq() that gates whether to attempt
    retransmission of a message performing in an undesirable way.
    Depending on the relative value of 'jiffies', this comparison:
    time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)
    may return true or false given that 'nxt_retr' remains at the
    uninitialised value of 0 for non bc_sndlinks.

    This is most noticeable shortly after boot when jiffies is initialised
    to a high value (to flush out rollover bugs) and we compare a jiffies of,
    say, 4294940189 to zero. In that case time_before returns 'true' leading
    to the skb not being retransmitted.

    The fix is to ensure that all skbs have a valid 'nxt_retr' time set for
    them and this is achieved by refactoring the setting of this value into
    a central function.
    With this fix, transmission losses of 'bc_init' messages do not stall
    the link establishment forever because the 'bc_init' message is
    retransmitted and the link eventually establishes correctly.

    Fixes: 382f598fb66b ("tipc: reduce duplicate packets for unicast traffic")
    Acked-by: Jon Maloy
    Signed-off-by: Hamish Martin
    Signed-off-by: David S. Miller

    Hamish Martin
     
  • In the tx path of l2tp, l2tp_xmit_skb() calls skb_dst_set() to set
    skb's dst. However, it will eventually call inet6_csk_xmit() or
    ip_queue_xmit() where skb's dst will be overwritten by:

    skb_dst_set_noref(skb, dst);

    without releasing the old dst in skb. Then it causes dst/dev refcnt leak:

    unregister_netdevice: waiting for eth0 to become free. Usage count = 1

    This can be reproduced by simply running:

    # modprobe l2tp_eth && modprobe l2tp_ip
    # sh ./tools/testing/selftests/net/l2tp.sh

    So before going to inet6_csk_xmit() or ip_queue_xmit(), skb's dst
    should be dropped. This patch is to fix it by removing skb_dst_set()
    from l2tp_xmit_skb() and moving skb_dst_drop() into l2tp_xmit_core().

    Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core")
    Reported-by: Hangbin Liu
    Signed-off-by: Xin Long
    Acked-by: James Chapman
    Tested-by: James Chapman
    Signed-off-by: David S. Miller

    Xin Long
     
  • CLC proposal messages of future SMCD versions could be larger than SMCD
    V1 CLC proposal messages.
    To enable toleration in SMC V1 the receival of CLC proposal messages
    is adapted:
    * accept larger length values in CLC proposal
    * check trailing eye catcher for incoming CLC proposal with V1 length only
    * receive the whole CLC proposal even in cases it does not fit into the
    V1 buffer

    Fixes: e7b7a64a8493d ("smc: support variable CLC proposal messages")
    Signed-off-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • The similar smc_ib_devices spinlock has been converted to a mutex.
    Protecting the smcd_dev_list by a mutex is possible as well. This
    patch converts the smcd_dev_list spinlock to a mutex.

    Fixes: c6ba7c9ba43d ("net/smc: add base infrastructure for SMC-D and ISM")
    Signed-off-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • Tests showed this BUG:
    [572555.252867] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
    [572555.252876] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 131031, name: smcapp
    [572555.252879] INFO: lockdep is turned off.
    [572555.252883] CPU: 1 PID: 131031 Comm: smcapp Tainted: G O 5.7.0-rc3uschi+ #356
    [572555.252885] Hardware name: IBM 3906 M03 703 (LPAR)
    [572555.252887] Call Trace:
    [572555.252896] [] show_stack+0x94/0xe8
    [572555.252901] [] dump_stack+0xa0/0xe0
    [572555.252906] [] ___might_sleep+0x260/0x280
    [572555.252910] [] __mutex_lock+0x48/0x940
    [572555.252912] [] mutex_lock_nested+0x32/0x40
    [572555.252975] [] mlx5_lag_get_roce_netdev+0x30/0xc0 [mlx5_core]
    [572555.252996] [] mlx5_ib_get_netdev+0x3a/0xe0 [mlx5_ib]
    [572555.253007] [] smc_pnet_find_roce_resource+0x1d8/0x310 [smc]
    [572555.253011] [] __smc_connect+0x1f0/0x3e0 [smc]
    [572555.253015] [] smc_connect+0x154/0x190 [smc]
    [572555.253022] [] __sys_connect+0x94/0xd0
    [572555.253025] [] __s390x_sys_socketcall+0x170/0x360
    [572555.253028] [] system_call+0x298/0x2b8
    [572555.253030] INFO: lockdep is turned off.

    Function smc_pnet_find_rdma_dev() might be called from
    smc_pnet_find_roce_resource(). It holds the smc_ib_devices list
    spinlock while calling infiniband op get_netdev(). At least for mlx5
    the get_netdev operation wants mutex serialization, which conflicts
    with the smc_ib_devices spinlock.
    This patch switches the smc_ib_devices spinlock into a mutex to
    allow sleeping when calling get_netdev().

    Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client")
    Signed-off-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • Wait for pending sends only when smc_switch_conns() found a link to move
    the connections to. Do not wait during link freeing, this can lead to
    permanent hang situations. And refuse to provide a new tx slot on an
    unusable link.

    Fixes: c6f02ebeea3a ("net/smc: switch connections to alternate link")
    Reviewed-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Karsten Graul
     
  • There might be races in scenarios where both SMC link groups are on the
    same system. Prevent that by creating separate wait queues for LLC flows
    and messages. Switch to non-interruptable versions of wait_event() and
    wake_up() for the llc flow waiter to make sure the waiters get control
    sequentially. Fine tune the llc_flow_lock to include the assignment of
    the message. Write to system log when an unexpected message was
    dropped. And remove an extra indirection and use the existing local
    variable lgr in smc_llc_enqueue().

    Fixes: 555da9af827d ("net/smc: add event-based llc_flow framework")
    Reviewed-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Karsten Graul
     

08 Jul, 2020

7 commits

  • While pipes don't really need sb_writers projection, __kernel_write is an
    interface better kept private, and the additional rw_verify_area does not
    hurt here.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Commit e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in
    igmp3/mld2 report handling") introduced a bug in the IPv6 header payload
    length check which would potentially lead to rejecting a valid MLD2 Report:

    The check needs to take into account the 2 bytes for the "Number of
    Sources" field in the "Multicast Address Record" before reading it.
    And not the size of a pointer to this field.

    Fixes: e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling")
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller

    Linus Lüssing
     
  • When tcf_ct_act execute the tcf_lastuse_update should
    be update or the used stats never update

    filter protocol ip pref 3 flower chain 0
    filter protocol ip pref 3 flower chain 0 handle 0x1
    eth_type ipv4
    dst_ip 1.1.1.1
    ip_flags frag/firstfrag
    skip_hw
    not_in_hw
    action order 1: ct zone 1 nat pipe
    index 1 ref 1 bind 1 installed 103 sec used 103 sec
    Action statistics:
    Sent 151500 bytes 101 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    cookie 4519c04dc64a1a295787aab13b6a50fb

    Signed-off-by: wenxu
    Signed-off-by: David S. Miller

    wenxu
     
  • The RFC 8684 mandates that no-data DATA FIN packets should carry
    a DSS with 0 sequence number and data len equal to 1. Currently,
    on FIN retransmission we re-use the existing mapping; if the previous
    fin transmission was part of a partially acked data packet, we could
    end-up writing in the egress packet a non-compliant DSS.

    The above will be detected by a "Bad mapping" warning on the receiver
    side.

    This change addresses the issue explicitly checking for 0 len packet
    when adding the DATA_FIN option.

    Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets")
    Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com
    Tested-by: Christoph Paasch
    Reviewed-by: Christoph Paasch
    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
    incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
    both raw sockets and IPv6 ping sockets set those fields.

    Expected output of "ip xfrm monitor":
    acquire proto esp
    sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
    policy src 10.0.2.15/32 dst 8.8.8.8/32

    Currently with ping sockets:
    acquire proto esp
    sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
    policy src 10.0.2.15/32 dst 8.8.8.8/32

    The Libreswan test suite found this problem after Fedora changed the
    value for the sysctl net.ipv4.ping_group_range.

    Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
    Reported-by: Paul Wouters
    Tested-by: Paul Wouters
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     
  • When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.

    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.

    The global variable 'cgroup_sk_alloc_disabled' is used to determine
    whether to take these reference counts. It is impossible to make
    the reference counting correct unless we save this bit of information
    in skcd->val. So, add a new bit there to record whether the socket
    has already taken the reference counts. This obviously relies on
    kmalloc() to align cgroup pointers to at least 4 bytes,
    ARCH_KMALLOC_MINALIGN is certainly larger than that.

    This bug seems to be introduced since the beginning, commit
    d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    tried to fix it but not compeletely. It seems not easy to trigger until
    the recent commit 090e28b229af
    ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Cameron Berkenpas
    Reported-by: Peter Geis
    Reported-by: Lu Fengqi
    Reported-by: Daniël Sonck
    Reported-by: Zhang Qiang
    Tested-by: Cameron Berkenpas
    Tested-by: Peter Geis
    Tested-by: Thomas Lamprecht
    Cc: Daniel Borkmann
    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Thomas reported a regression with IPv6 and anycast using the following
    reproducer:

    echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
    ip -6 a add fc12::1/16 dev lo
    sleep 2
    echo "pinging lo"
    ping6 -c 2 fc12::

    The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed
    the use of fib6_is_reject which checks addresses added to the loopback
    interface and sets the REJECT flag as needed. Update fib6_is_reject for
    loopback checks to handle RTF_ANYCAST addresses.

    Fixes: c7a1ce397ada ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
    Reported-by: thomas.gambier@nexedi.com
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 Jul, 2020

1 commit

  • Brian reported a crash in IPv6 code when using rpfilter with a setup
    running FRR and external nexthop objects. The root cause of the crash
    is fib6_select_path setting fib6_nh in the result to NULL because of
    an improper check for nexthop objects.

    More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif
    set causing fib6_select_path to be called with have_oif_match set.
    fib6_select_path has early check on have_oif_match and jumps to the
    out label which presumes a builtin fib6_nh. This path is invalid for
    nexthop objects; for external nexthops fib6_select_path needs to just
    return if the fib6_nh has already been set in the result otherwise it
    returns after the call to nexthop_path_fib6_result. Update the check
    on have_oif_match to not bail on external nexthops.

    Update selftests for this problem.

    Fixes: f88d8ea67fbd ("ipv6: Plumb support for nexthop object in a fib6_info")
    Reported-by: Brian Rak
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

05 Jul, 2020

2 commits

  • To release hsr(upper) interface, it should release
    its own lower interfaces first.
    Then, hsr(upper) interface can be released safely.
    In the current code of error path of hsr_dev_finalize(), it releases hsr
    interface before releasing a lower interface.
    So, a warning occurs, which warns about the leak of lower interfaces.
    In order to fix this problem, changing the ordering of the error path of
    hsr_dev_finalize() is needed.

    Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add dummy2 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link add hsr1 type hsr slave1 dummy2 slave2 dummy0

    Splat looks like:
    [ 214.923127][ C2] WARNING: CPU: 2 PID: 1093 at net/core/dev.c:8992 rollback_registered_many+0x986/0xcf0
    [ 214.923129][ C2] Modules linked in: hsr dummy openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipx
    [ 214.923154][ C2] CPU: 2 PID: 1093 Comm: ip Not tainted 5.8.0-rc2+ #623
    [ 214.923156][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [ 214.923157][ C2] RIP: 0010:rollback_registered_many+0x986/0xcf0
    [ 214.923160][ C2] Code: 41 8b 4e cc 45 31 c0 31 d2 4c 89 ee 48 89 df e8 e0 47 ff ff 85 c0 0f 84 cd fc ff ff 5
    [ 214.923162][ C2] RSP: 0018:ffff8880c5156f28 EFLAGS: 00010287
    [ 214.923165][ C2] RAX: ffff8880d1dad458 RBX: ffff8880bd1b9000 RCX: ffffffffb929d243
    [ 214.923167][ C2] RDX: 1ffffffff77e63f0 RSI: 0000000000000008 RDI: ffffffffbbf31f80
    [ 214.923168][ C2] RBP: dffffc0000000000 R08: fffffbfff77e63f1 R09: fffffbfff77e63f1
    [ 214.923170][ C2] R10: ffffffffbbf31f87 R11: 0000000000000001 R12: ffff8880c51570a0
    [ 214.923172][ C2] R13: ffff8880bd1b90b8 R14: ffff8880c5157048 R15: ffff8880d1dacc40
    [ 214.923174][ C2] FS: 00007fdd257a20c0(0000) GS:ffff8880da200000(0000) knlGS:0000000000000000
    [ 214.923175][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 214.923177][ C2] CR2: 00007ffd78beb038 CR3: 00000000be544005 CR4: 00000000000606e0
    [ 214.923179][ C2] Call Trace:
    [ 214.923180][ C2] ? netif_set_real_num_tx_queues+0x780/0x780
    [ 214.923182][ C2] ? dev_validate_mtu+0x140/0x140
    [ 214.923183][ C2] ? synchronize_rcu.part.79+0x85/0xd0
    [ 214.923185][ C2] ? synchronize_rcu_expedited+0xbb0/0xbb0
    [ 214.923187][ C2] rollback_registered+0xc8/0x170
    [ 214.923188][ C2] ? rollback_registered_many+0xcf0/0xcf0
    [ 214.923190][ C2] unregister_netdevice_queue+0x18b/0x240
    [ 214.923191][ C2] hsr_dev_finalize+0x56e/0x6e0 [hsr]
    [ 214.923192][ C2] hsr_newlink+0x36b/0x450 [hsr]
    [ 214.923194][ C2] ? hsr_dellink+0x70/0x70 [hsr]
    [ 214.923195][ C2] ? rtnl_create_link+0x2e4/0xb00
    [ 214.923197][ C2] ? __netlink_ns_capable+0xc3/0xf0
    [ 214.923198][ C2] __rtnl_newlink+0xbdb/0x1270
    [ ... ]

    Fixes: e0a4b99773d3 ("hsr: use upper/lower device infrastructure")
    Reported-by: syzbot+7f1c020f68dab95aab59@syzkaller.appspotmail.com
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Taehee Yoo
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Use kvfree() to release vmalloc()'ed areas in ipset, from Eric Dumazet.

    2) UAF in nfnetlink_queue from the nf_conntrack_update() path.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Jul, 2020

1 commit

  • There are a couple of places in net/sched/ that check skb->protocol and act
    on the value there. However, in the presence of VLAN tags, the value stored
    in skb->protocol can be inconsistent based on whether VLAN acceleration is
    enabled. The commit quoted in the Fixes tag below fixed the users of
    skb->protocol to use a helper that will always see the VLAN ethertype.

    However, most of the callers don't actually handle the VLAN ethertype, but
    expect to find the IP header type in the protocol field. This means that
    things like changing the ECN field, or parsing diffserv values, stops
    working if there's a VLAN tag, or if there are multiple nested VLAN
    tags (QinQ).

    To fix this, change the helper to take an argument that indicates whether
    the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
    make sure to skip all of them, so behaviour is consistent even in QinQ
    mode.

    To make the helper usable from the ECN code, move it to if_vlan.h instead
    of pkt_sched.h.

    v3:
    - Remove empty lines
    - Move vlan variable definitions inside loop in skb_protocol()
    - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
    bpf_skb_ecn_set_ce()

    v2:
    - Use eth_type_vlan() helper in skb_protocol()
    - Also fix code that reads skb->protocol directly
    - Change a couple of 'if/else if' statements to switch constructs to avoid
    calling the helper twice

    Reported-by: Ilya Ponetayev
    Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

03 Jul, 2020

3 commits

  • __nf_conntrack_update() might refresh the conntrack object that is
    attached to the skbuff. Otherwise, this triggers UAF.

    [ 633.200434] ==================================================================
    [ 633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769

    [ 633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388
    [ 633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
    [ 633.200491] Call Trace:
    [ 633.200499] dump_stack+0x7c/0xb0
    [ 633.200526] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200532] print_address_description.constprop.6+0x1a/0x200
    [ 633.200539] ? _raw_write_lock_irqsave+0xc0/0xc0
    [ 633.200568] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200594] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200598] kasan_report.cold.9+0x1f/0x42
    [ 633.200604] ? call_rcu+0x2c0/0x390
    [ 633.200633] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200659] nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200687] ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack]

    Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436
    Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Pull nfsd fixes from Bruce Fields:
    "Fixes for a umask bug on exported filesystems lacking ACL support, a
    leak and a module unloading bug in the /proc/fs/nfsd/clients/ code,
    and a compile warning"

    * tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux:
    SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
    nfsd: fix nfsdfs inode reference count leak
    nfsd4: fix nfsdfs reference count loop
    nfsd: apply umask on fs without ACL support

    Linus Torvalds
     
  • This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG
    or TCP_MD5SIG_EXT on established sockets")

    Mathieu reported that many vendors BGP implementations can
    actually switch TCP MD5 on established flows.

    Quoting Mathieu :
    Here is a list of a few network vendors along with their behavior
    with respect to TCP MD5:

    - Cisco: Allows for password to be changed, but within the hold-down
    timer (~180 seconds).
    - Juniper: When password is initially set on active connection it will
    reset, but after that any subsequent password changes no network
    resets.
    - Nokia: No notes on if they flap the tcp connection or not.
    - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
    both sides are ok with new passwords.
    - Meta-Switch: Expects the password to be set before a connection is
    attempted, but no further info on whether they reset the TCP
    connection on a change.
    - Avaya: Disable the neighbor, then set password, then re-enable.
    - Zebos: Would normally allow the change when socket connected.

    We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential
    overestimation of TCP option space") removed the leak of 4 kernel bytes to
    the wire that was the main reason for my patch.

    While doing my investigations, I found a bug when a MD5 key is changed, leading
    to these commits that stable teams want to consider before backporting this revert :

    Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")

    Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
    Signed-off-by: Eric Dumazet
    Reported-by: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Jul, 2020

6 commits

  • Whenever tcp_try_rmem_schedule() returns an error, we are under
    trouble and should make sure to wakeup readers so that they
    can drain socket queues and eventually make room.

    Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When no full socket is available, skbs are sent over a per-netns
    control socket. Its sk_mark is temporarily adjusted to match that
    of the real (request or timewait) socket or to reflect an incoming
    skb, so that the outgoing skb inherits this in __ip_make_skb.

    Introduction of the socket cookie mark field broke this. Now the
    skb is set through the cookie and cork:

    # init sockc.mark from sk_mark or cmsg
    ip_append_data
    ip_setup_cork # convert sockc.mark to cork mark
    ip_push_pending_frames
    ip_finish_skb
    __ip_make_skb # set skb->mark to cork mark

    But I missed these special control sockets. Update all callers of
    __ip(6)_make_skb that were originally missed.

    For IPv6, the same two icmp(v6) paths are affected. The third
    case is not, as commit 92e55f412cff ("tcp: don't annotate
    mark on control socket from tcp_v6_send_response()") replaced
    the ctl_sk->sk_mark with passing the mark field directly as a
    function argument. That commit predates the commit that
    introduced the bug.

    Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg")
    Signed-off-by: Willem de Bruijn
    Reported-by: Martin KaFai Lau
    Reviewed-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Whenever cookie_init_timestamp() has been used to encode
    ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

    Otherwise, tcp_synack_options() will still advertize options like WSCALE
    that we can not deduce later when receiving the packet from the client
    to complete 3WHS.

    Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
    but we can not know for sure that all TCP stacks have the same logic.

    Before the fix a tcpdump would exhibit this wrong exchange :

    10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
    10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
    10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
    10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
    10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0

    After this patch the exchange looks saner :

    11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
    11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
    11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
    11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
    11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
    11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
    11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

    Of course, no SACK block will ever be added later, but nothing should break.
    Technically, we could remove the 4 nops included in MD5+TS options,
    but again some stacks could break seeing not conventional alignment.

    Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In testing with mprds enabled, Oracle Cluster nodes after reboot were
    not able to communicate with others nodes and so failed to rejoin
    the cluster. Peers with lower IP address initiated connection but the
    node could not respond as it choose a different path and could not
    initiate a connection as it had a higher IP address.

    With this patch, when a node sends out a packet and the selected path
    is down, all other paths are also checked and any down paths are
    re-connected.

    Reviewed-by: Ka-cheong Poon
    Reviewed-by: David Edmondson
    Signed-off-by: Somasundaram Krishnasamy
    Signed-off-by: Rao Shoaib
    Signed-off-by: David S. Miller

    Rao Shoaib
     
  • My prior fix went a bit too far, according to Herbert and Mathieu.

    Since we accept that concurrent TCP MD5 lookups might see inconsistent
    keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()

    Clearing all key->key[] is needed to avoid possible KMSAN reports,
    if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
    using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.

    data_race() was added in linux-5.8 and will prevent KCSAN reports,
    this can safely be removed in stable backports, if data_race() is
    not yet backported.

    v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()

    Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Cc: Herbert Xu
    Cc: Marco Elver
    Reviewed-by: Mathieu Desnoyers
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A potential deadlock can occur during registering or unregistering a
    new generic netlink family between the main nl_table_lock and the
    cb_lock where each thread wants the lock held by the other, as
    demonstrated below.

    1) Thread 1 is performing a netlink_bind() operation on a socket. As part
    of this call, it will call netlink_lock_table(), incrementing the
    nl_table_users count to 1.
    2) Thread 2 is registering (or unregistering) a genl_family via the
    genl_(un)register_family() API. The cb_lock semaphore will be taken for
    writing.
    3) Thread 1 will call genl_bind() as part of the bind operation to handle
    subscribing to GENL multicast groups at the request of the user. It will
    attempt to take the cb_lock semaphore for reading, but it will fail and
    be scheduled away, waiting for Thread 2 to finish the write.
    4) Thread 2 will call netlink_table_grab() during the (un)registration
    call. However, as Thread 1 has incremented nl_table_users, it will not
    be able to proceed, and both threads will be stuck waiting for the
    other.

    genl_bind() is a noop, unless a genl_family implements the mcast_bind()
    function to handle setting up family-specific multicast operations. Since
    no one in-tree uses this functionality as Cong pointed out, simply removing
    the genl_bind() function will remove the possibility for deadlock, as there
    is no attempt by Thread 1 above to take the cb_lock semaphore.

    Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families")
    Suggested-by: Cong Wang
    Acked-by: Johannes Berg
    Reported-by: kernel test robot
    Signed-off-by: Sean Tranchetti
    Signed-off-by: David S. Miller

    Sean Tranchetti
     

01 Jul, 2020

6 commits

  • This code assumes that the user passed in enough data for a
    qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true. If
    the buffer is too small then it will read beyond the end.

    Reported-by: Manivannan Sadhasivam
    Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com
    Fixes: 194ccc88297a ("net: qrtr: Support decoding incoming v2 packets")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • MD5 keys are read with RCU protection, and tcp_md5_do_add()
    might update in-place a prior key.

    Normally, typical RCU updates would allocate a new piece
    of memory. In this case only key->key and key->keylen might
    be updated, and we do not care if an incoming packet could
    see the old key, the new one, or some intermediate value,
    since changing the key on a live flow is known to be problematic
    anyway.

    We only want to make sure that in the case key->keylen
    is changed, cpus in tcp_md5_hash_key() wont try to use
    uninitialized data, or crash because key->keylen was
    read twice to feed sg_init_one() and ahash_request_set_crypt()

    Fixes: 9ea88a153001 ("tcp: md5: check md5 signature without socket lock")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The flow is allocated in qrtr_tx_wait, but not freed when qrtr node
    is released. (*slot) becomes NULL after radix_tree_iter_delete is
    called in __qrtr_node_release. The fix is to save (*slot) to a
    vairable and then free it.

    This memory leak is catched when kmemleak is enabled in kernel,
    the report looks like below:

    unreferenced object 0xffffa0de69e08420 (size 32):
    comm "kworker/u16:3", pid 176, jiffies 4294918275 (age 82858.876s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 28 84 e0 69 de a0 ff ff ........(..i....
    28 84 e0 69 de a0 ff ff 03 00 00 00 00 00 00 00 (..i............
    backtrace:
    [] qrtr_node_enqueue+0x38e/0x400 [qrtr]
    [] qrtr_sendmsg+0x1e0/0x2a0 [qrtr]
    [] sock_sendmsg+0x5b/0x60
    [] qmi_send_message.isra.3+0xbe/0x110 [qmi_helpers]
    [] qmi_send_request+0x1c/0x20 [qmi_helpers]

    Signed-off-by: Carl Huang
    Signed-off-by: David S. Miller

    Carl Huang
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-06-30

    The following pull-request contains BPF updates for your *net* tree.

    We've added 28 non-merge commits during the last 9 day(s) which contain
    a total of 35 files changed, 486 insertions(+), 232 deletions(-).

    The main changes are:

    1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer
    types, from Yonghong Song.

    2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various
    arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki.

    3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb
    internals and integrate it properly into DMA core, from Christoph Hellwig.

    4) Fix RCU splat from recent changes to avoid skipping ingress policy when
    kTLS is enabled, from John Fastabend.

    5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its
    position masking to work, from Andrii Nakryiko.

    6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading
    of network programs, from Maciej Żenczykowski.

    7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer.

    8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add two tests for PTR_TO_BTF_ID vs. null ptr comparison,
    one for PTR_TO_BTF_ID in the ctx structure and the
    other for PTR_TO_BTF_ID after one level pointer chasing.
    In both cases, the test ensures condition is not
    removed.

    For example, for this test
    struct bpf_fentry_test_t {
    struct bpf_fentry_test_t *a;
    };
    int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
    {
    if (arg == 0)
    test7_result = 1;
    return 0;
    }
    Before the previous verifier change, we have xlated codes:
    int test7(long long unsigned int * ctx):
    ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
    0: (79) r1 = *(u64 *)(r1 +0)
    ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
    1: (b4) w0 = 0
    2: (95) exit
    After the previous verifier change, we have:
    int test7(long long unsigned int * ctx):
    ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
    0: (79) r1 = *(u64 *)(r1 +0)
    ; if (arg == 0)
    1: (55) if r1 != 0x0 goto pc+4
    ; test7_result = 1;
    2: (18) r1 = map[id:6][0]+48
    4: (b7) r2 = 1
    5: (7b) *(u64 *)(r1 +0) = r2
    ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
    6: (b4) w0 = 0
    7: (95) exit

    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com

    Yonghong Song
     
  • The xfrm interface uses skb->protocol to determine packet type, and
    bails out if it's not set. For AF_PACKET injection, we need to support
    its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
    dev_parse_header_protocol -> parse_protocol

    Without a valid parse_protocol, this returns zero, and xfrmi rejects the
    skb. So, this wires up the ip_tunnel handler for layer 3 packets for
    that case.

    Reported-by: Willem de Bruijn
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller

    Jason A. Donenfeld