14 Jun, 2020

2 commits

  • Pull networking fixes from David Miller:

    1) Fix cfg80211 deadlock, from Johannes Berg.

    2) RXRPC fails to send norigications, from David Howells.

    3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

    4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

    5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

    6) Fix hashtable memory leak in dccp, from Wang Hai.

    7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

    8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

    9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

    10) Fix link speed reporting in iavf driver, from Brett Creeley.

    11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

    12) Fix memory leak in genetlink, from Cong Wang.

    13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
    net: ethernet: ti: ale: fix allmulti for nu type ale
    net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
    net: atm: Remove the error message according to the atomic context
    bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
    libbpf: Support pre-initializing .bss global variables
    tools/bpftool: Fix skeleton codegen
    bpf: Fix memlock accounting for sock_hash
    bpf: sockmap: Don't attach programs to UDP sockets
    bpf: tcp: Recv() should return 0 when the peer socket is closed
    ibmvnic: Flush existing work items before device removal
    genetlink: clean up family attributes allocations
    net: ipa: header pad field only valid for AP->modem endpoint
    net: ipa: program upper nibbles of sequencer type
    net: ipa: fix modem LAN RX endpoint id
    net: ipa: program metadata mask differently
    ionic: add pcie_print_link_status
    rxrpc: Fix race between incoming ACK parser and retransmitter
    net/mlx5: E-Switch, Fix some error pointer dereferences
    net/mlx5: Don't fail driver on failure to create debugfs
    net/mlx5e: CT: Fix ipv6 nat header rewrite actions
    ...

    Linus Torvalds
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-06-12

    The following pull-request contains BPF updates for your *net* tree.

    We've added 26 non-merge commits during the last 10 day(s) which contain
    a total of 27 files changed, 348 insertions(+), 93 deletions(-).

    The main changes are:

    1) sock_hash accounting fix, from Andrey.

    2) libbpf fix and probe_mem sanitizing, from Andrii.

    3) sock_hash fixes, from Jakub.

    4) devmap_val fix, from Jesper.

    5) load_bytes_relative fix, from YiFei.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Jun, 2020

2 commits

  • Add missed bpf_map_charge_init() in sock_hash_alloc() and
    correspondingly bpf_map_charge_finish() on ENOMEM.

    It was found accidentally while working on unrelated selftest that
    checks "map->memory.pages > 0" is true for all map types.

    Before:
    # bpftool m l
    ...
    3692: sockhash name m_sockhash flags 0x0
    key 4B value 4B max_entries 8 memlock 0B

    After:
    # bpftool m l
    ...
    84: sockmap name m_sockmap flags 0x0
    key 4B value 4B max_entries 8 memlock 4096B

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com

    Andrey Ignatov
     
  • The stream parser infrastructure isn't set up to deal with UDP
    sockets, so we mustn't try to attach programs to them.

    I remember making this change at some point, but I must have lost
    it while rebasing or something similar.

    Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200611172520.327602-1-lmb@cloudflare.com

    Lorenz Bauer
     

11 Jun, 2020

2 commits

  • Added a check in the switch case on start_header that checks for
    the existence of the header, and in the case that MAC is not set
    and the caller requests for MAC, -EFAULT. If the caller requests
    for NET then MAC's existence is completely ignored.

    There is no function to check NET header's existence and as far
    as cgroup_skb/egress is concerned it should always be set.

    Removed for ptr >= the start of header, considering offset is
    bounded unsigned and should always be true. len
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com

    YiFei Zhu
     
  • Pull sysctl fixes from Al Viro:
    "Fixups to regressions in sysctl series"

    * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sysctl: reject gigantic reads/write to sysctl files
    cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
    trace: fix an incorrect __user annotation on stack_trace_sysctl
    random: fix an incorrect __user annotation on proc_do_entropy
    net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
    net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl

    Linus Torvalds
     

10 Jun, 2020

3 commits

  • The dynamic key update for addr_list_lock still causes troubles,
    for example the following race condition still exists:

    CPU 0: CPU 1:
    (RCU read lock) (RTNL lock)
    dev_mc_seq_show() netdev_update_lockdep_key()
    -> lockdep_unregister_key()
    -> netif_addr_lock_bh()

    because lockdep doesn't provide an API to update it atomically.
    Therefore, we have to move it back to static keys and use subclass
    for nest locking like before.

    In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key
    changes"), I already reverted most parts of commit ab92d68fc22f
    ("net: core: add generic lockdep keys").

    This patch reverts the rest and also part of commit f3b0a18bb6cb
    ("net: remove unnecessary variables and callback"). After this
    patch, addr_list_lock changes back to using static keys and
    subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
    not have to change back to ->ndo_get_lock_subclass().

    And hopefully this reduces some syzbot lockdep noises too.

    Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
    Cc: Taehee Yoo
    Cc: Dmitry Vyukov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • We can end up modifying the sockhash bucket list from two CPUs when a
    sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
    that is in the sockhash is unlinking itself from it on another CPU
    it (sock_hash_delete_from_link).

    This results in accessing a list element that is in an undefined state as
    reported by KASAN:

    | ==================================================================
    | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
    | Write of size 8 at addr dead000000000122 by task kworker/2:1/95
    |
    | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x97/0xe0
    | ? sock_hash_free+0x13c/0x280
    | __kasan_report.cold+0x5/0x40
    | ? mark_lock+0xbc1/0xc00
    | ? sock_hash_free+0x13c/0x280
    | kasan_report+0x38/0x50
    | ? sock_hash_free+0x152/0x280
    | sock_hash_free+0x13c/0x280
    | bpf_map_free_deferred+0xb2/0xd0
    | ? bpf_map_charge_finish+0x50/0x50
    | ? rcu_read_lock_sched_held+0x81/0xb0
    | ? rcu_read_lock_bh_held+0x90/0x90
    | process_one_work+0x59a/0xac0
    | ? lock_release+0x3b0/0x3b0
    | ? pwq_dec_nr_in_flight+0x110/0x110
    | ? rwlock_bug.part.0+0x60/0x60
    | worker_thread+0x7a/0x680
    | ? _raw_spin_unlock_irqrestore+0x4c/0x60
    | kthread+0x1cc/0x220
    | ? process_one_work+0xac0/0xac0
    | ? kthread_create_on_node+0xa0/0xa0
    | ret_from_fork+0x24/0x30
    | ==================================================================

    Fix it by reintroducing spin-lock protected critical section around the
    code that removes the elements from the bucket on sockhash free.

    To do that we also need to defer processing of removed elements, until out
    of atomic context so that we can unlink the socket from the map when
    holding the sock lock.

    Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
    Reported-by: Eric Dumazet
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com

    Jakub Sitnicki
     
  • When sockhash gets destroyed while sockets are still linked to it, we will
    walk the bucket lists and delete the links. However, we are not freeing the
    list elements after processing them, leaking the memory.

    The leak can be triggered by close()'ing a sockhash map when it still
    contains sockets, and observed with kmemleak:

    unreferenced object 0xffff888116e86f00 (size 64):
    comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/.....
    backtrace:
    [] sock_hash_update_common+0x4ca/0x760
    [] sock_hash_update_elem+0x1d2/0x200
    [] __do_sys_bpf+0x2046/0x2990
    [] do_syscall_64+0xad/0x9a0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Fix it by freeing the list element when we're done with it.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com

    Jakub Sitnicki
     

08 Jun, 2020

1 commit


05 Jun, 2020

3 commits

  • Sequence counters write paths are critical sections that must never be
    preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

    Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
    netdev name retrieval.") handled a deadlock, observed with
    CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
    infinitely spinning: it got scheduled after the seqcount write side
    blocked inside its own critical section.

    To fix that deadlock, among other issues, the commit added a
    cond_resched() inside the read side section. While this will get the
    non-preemptible kernel eventually unstuck, the seqcount reader is fully
    exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

    The fix is also still broken: if the seqcount reader belongs to a
    real-time scheduling policy, it can spin forever and the kernel will
    livelock.

    Disabling preemption over the seqcount write side critical section will
    not work: inside it are a number of GFP_KERNEL allocations and mutex
    locking through the drivers/base/ :: device_rename() call chain.

    >From all the above, replace the seqcount with a rwsem.

    Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
    Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
    Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
    Cc:
    Reported-by: kbuild test robot [ v1 missing up_read() on error exit ]
    Reported-by: Dan Carpenter [ v1 missing up_read() on error exit ]
    Signed-off-by: Ahmed S. Darwish
    Reviewed-by: Sebastian Andrzej Siewior
    Signed-off-by: David S. Miller

    Ahmed S. Darwish
     
  • The seg6_validate_srh() is used to validate SRH for three cases:

    case1: SRH of data-plane SRv6 packets to be processed by the Linux kernel.
    Case2: SRH of the netlink message received from user-space (iproute2)
    Case3: SRH injected into packets through setsockopt

    In case1, the SRH can be encoded in the Reduced way (i.e., first SID is
    carried in DA only and not represented as SID in the SRH) and the
    seg6_validate_srh() now handles this case correctly.

    In case2 and case3, the SRH shouldn’t be encoded in the Reduced way
    otherwise we lose the first segment (i.e., the first hop).

    The current implementation of the seg6_validate_srh() allow SRH of case2
    and case3 to be encoded in the Reduced way. This leads a slab-out-of-bounds
    problem.

    This patch verifies SRH of case1, case2 and case3. Allowing case1 to be
    reduced while preventing SRH of case2 and case3 from being reduced .

    Reported-by: syzbot+e8c028b62439eac42073@syzkaller.appspotmail.com
    Reported-by: YueHaibing
    Fixes: 0cb7498f234e ("seg6: fix SRH processing to comply with RFC8754")
    Signed-off-by: Ahmed Abdelsalam
    Signed-off-by: David S. Miller

    Ahmed Abdelsalam
     
  • A recent commit added new variables only used if CONFIG_NETDEVICES is
    set. A simple fix would be to only declare these variables if the same
    condition is valid but Alexei suggested an even simpler solution:

    since CONFIG_NETDEVICES doesn't change anything in .h I think the
    best is to remove #ifdef CONFIG_NETDEVICES from net/core/filter.c
    and rely on sock_bindtoindex() returning ENOPROTOOPT in the extreme
    case of oddly configured kernels.

    Fixes: 70c58997c1e8 ("bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt")
    Suggested-by: Alexei Starovoitov
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200603190347.2310320-1-matthieu.baerts@tessares.net

    Matthieu Baerts
     

04 Jun, 2020

2 commits

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

03 Jun, 2020

2 commits

  • Add a bpf_csum_level() helper which BPF programs can use in combination
    with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
    flag to the latter to avoid falling back to CHECKSUM_NONE.

    The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
    via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
    on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
    csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
    current level. Without this helper, there is no way to otherwise adjust the
    skb->csum_level. I did not add an extra dummy flags as there is plenty of free
    bitspace in level argument itself iff ever needed in future.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Alan Maguire
    Acked-by: Lorenz Bauer
    Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Lorenz recently reported:

    In our TC classifier cls_redirect [0], we use the following sequence of
    helper calls to decapsulate a GUE (basically IP + UDP + custom header)
    encapsulated packet:

    bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

    It seems like some checksums of the inner headers are not validated in
    this case. For example, a TCP SYN packet with invalid TCP checksum is
    still accepted by the network stack and elicits a SYN ACK. [...]

    That is, we receive the following packet from the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

    ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
    On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

    Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
    into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
    turned into a no-op due to CHECKSUM_UNNECESSARY.

    The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
    it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
    not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
    skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
    there is no way to adjust the skb->csum_level. NICs that have checksum offload
    disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

    Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
    add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
    from opting out. Opting out is useful for the case where we don't remove/add
    full protocol headers, or for the case where a user wants to adjust the csum
    level manually e.g. through bpf_csum_level() helper that is added in subsequent
    patch.

    The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
    bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
    as well but doesn't change layers, only transitions between v4 to v6 and vice
    versa, therefore no adoption is required there.

    [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

    Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper")
    Reported-by: Lorenz Bauer
    Reported-by: Alan Maguire
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Alan Maguire
    Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
    Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net

    Daniel Borkmann
     

02 Jun, 2020

21 commits

  • Move functions to manage BPF programs attached to netns that are not
    specific to flow dissector to a dedicated module named
    bpf/net_namespace.c.

    The set of functions will grow with the addition of bpf_link support for
    netns attached programs. This patch prepares ground by creating a place
    for it.

    This is a code move with no functional changes intended.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com

    Jakub Sitnicki
     
  • In order to:

    (1) attach more than one BPF program type to netns, or
    (2) support attaching BPF programs to netns with bpf_link, or
    (3) support multi-prog attach points for netns

    we will need to keep more state per netns than a single pointer like we
    have now for BPF flow dissector program.

    Prepare for the above by extracting netns_bpf that is part of struct net,
    for storing all state related to BPF programs attached to netns.

    Turn flow dissector callbacks for querying/attaching/detaching a program
    into generic ones that operate on netns_bpf. Next patch will move the
    generic callbacks into their own module.

    This is similar to how it is organized for cgroup with cgroup_bpf.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Cc: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Split out the part of attach callback that happens with attach/detach lock
    acquired. This structures the prog attach callback in a way that opens up
    doors for moving the locking out of flow_dissector and into generic
    callbacks for attaching/detaching progs to netns in subsequent patches.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Extending the supported sockopts in bpf_setsockopt with
    SO_BINDTODEVICE. We call sock_bindtoindex with parameter
    lock_sk = false in this context because we already owning
    the socket.

    Signed-off-by: Ferenc Fejes
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu

    Ferenc Fejes
     
  • The sock_bindtoindex intended for kernel wide usage however
    it will lock the socket regardless of the context. This modification
    relax this behavior optionally: locking the socket will be optional
    by calling the sock_bindtoindex with lock_sk = true.

    The modification applied to all users of the sock_bindtoindex.

    Signed-off-by: Ferenc Fejes
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu

    Ferenc Fejes
     
  • KTLS uses a stream parser to collect TLS messages and send them to
    the upper layer tls receive handler. This ensures the tls receiver
    has a full TLS header to parse when it is run. However, when a
    socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
    is enabled we end up with two stream parsers running on the same
    socket.

    The result is both try to run on the same socket. First the KTLS
    stream parser runs and calls read_sock() which will tcp_read_sock
    which in turn calls tcp_rcv_skb(). This dequeues the skb from the
    sk_receive_queue. When this is done KTLS code then data_ready()
    callback which because we stacked KTLS on top of the bpf stream
    verdict program has been replaced with sk_psock_start_strp(). This
    will in turn kick the stream parser again and eventually do the
    same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
    a skb from the sk_receive_queue.

    At this point the data stream is broke. Part of the stream was
    handled by the KTLS side some other bytes may have been handled
    by the BPF side. Generally this results in either missing data
    or more likely a "Bad Message" complaint from the kTLS receive
    handler as the BPF program steals some bytes meant to be in a
    TLS header and/or the TLS header length is no longer correct.

    We've already broke the idealized model where we can stack ULPs
    in any order with generic callbacks on the TX side to handle this.
    So in this patch we do the same thing but for RX side. We add
    a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
    program is running and add a tls_sw_has_ctx_rx() helper so BPF
    side can learn there is a TLS ULP on the socket.

    Then on BPF side we omit calling our stream parser to avoid
    breaking the data stream for the KTLS receiver. Then on the
    KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
    receiver is done with the packet but before it posts the
    msg to userspace. This gives us symmetry between the TX and
    RX halfs and IMO makes it usable again. On the TX side we
    process packets in this order BPF -> TLS -> TCP and on
    the receive side in the reverse order TCP -> TLS -> BPF.

    Discovered while testing OpenSSL 3.0 Alpha2.0 release.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • We will need this block of code called from tls context shortly
    lets refactor the redirect logic so its easy to use. This also
    cleans up the switch stmt so we have fewer fallthrough cases.

    No logic changes are intended.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jakub Sitnicki
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
    moment only the device is added. Other fields (queue_index)
    can be added as use cases arise.

    >From a UAPI perspective, add egress_ifindex to xdp context for
    bpf programs to see the Tx device.

    Update the verifier to only allow accesses to egress_ifindex by
    XDP programs with BPF_XDP_DEVMAP expected attach type.

    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
    Signed-off-by: Alexei Starovoitov

    David Ahern
     
  • Add BPF_XDP_DEVMAP attach type for use with programs associated with a
    DEVMAP entry.

    Allow DEVMAPs to associate a program with a device entry by adding
    a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
    id, so the fd and id are a union. bpf programs can get access to the
    struct via vmlinux.h.

    The program associated with the fd must have type XDP with expected
    attach type BPF_XDP_DEVMAP. When a program is associated with a device
    index, the program is run on an XDP_REDIRECT and before the buffer is
    added to the per-cpu queue. At this point rxq data is still valid; the
    next patch adds tx device information allowing the prorgam to see both
    ingress and egress device indices.

    XDP generic is skb based and XDP programs do not work with skb's. Block
    the use case by walking maps used by a program that is to be attached
    via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

    Block attach of BPF_XDP_DEVMAP programs to devices.

    Signed-off-by: David Ahern
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
    Signed-off-by: Alexei Starovoitov

    David Ahern
     
  • Add "rx_queue_mapping" to bpf_sock. This gives read access for the
    existing field (sk_rx_queue_mapping) of struct sock from bpf_sock.
    Semantics for the bpf_sock rx_queue_mapping access are similar to
    sk_rx_queue_get(), i.e the value NO_QUEUE_MAPPING is not allowed
    and -1 is returned in that case. This is useful for transmit queue
    selection based on the received queue index which is cached in the
    socket in the receive path.

    v3: Addressed review comments to add usecase in patch description,
    and fixed default value for rx_queue_mapping.
    v2: fixed build error for CONFIG_XPS wrapping, reported by
    kbuild test robot

    Signed-off-by: Amritha Nambiar
    Signed-off-by: Alexei Starovoitov

    Amritha Nambiar
     
  • Add helpers to use local socket storage.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/159033907577.12355.14740125020572756560.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • Add these generic helpers that may be useful to use from sk_msg programs.
    The helpers do not depend on ctx so we can simply add them here,

    BPF_FUNC_perf_event_output
    BPF_FUNC_get_current_uid_gid
    BPF_FUNC_get_current_pid_tgid
    BPF_FUNC_get_current_cgroup_id
    BPF_FUNC_get_current_ancestor_cgroup_id
    BPF_FUNC_get_cgroup_classid

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/159033903373.12355.15489763099696629346.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • Pull crypto updates from Herbert Xu:
    "API:
    - Introduce crypto_shash_tfm_digest() and use it wherever possible.
    - Fix use-after-free and race in crypto_spawn_alg.
    - Add support for parallel and batch requests to crypto_engine.

    Algorithms:
    - Update jitter RNG for SP800-90B compliance.
    - Always use jitter RNG as seed in drbg.

    Drivers:
    - Add Arm CryptoCell driver cctrng.
    - Add support for SEV-ES to the PSP driver in ccp"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
    crypto: hisilicon - fix driver compatibility issue with different versions of devices
    crypto: engine - do not requeue in case of fatal error
    crypto: cavium/nitrox - Fix a typo in a comment
    crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
    crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
    crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
    crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
    crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
    crypto: hisilicon/qm - add debugfs to the QM state machine
    crypto: hisilicon/qm - add debugfs for QM
    crypto: stm32/crc32 - protect from concurrent accesses
    crypto: stm32/crc32 - don't sleep in runtime pm
    crypto: stm32/crc32 - fix multi-instance
    crypto: stm32/crc32 - fix run-time self test issue.
    crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
    crypto: hisilicon/zip - Use temporary sqe when doing work
    crypto: hisilicon - add device error report through abnormal irq
    crypto: hisilicon - remove codes of directly report device errors through MSI
    crypto: hisilicon - QM memory management optimization
    crypto: hisilicon - unify initial value assignment into QM
    ...

    Linus Torvalds
     
  • Add packet traps for packets that are sampled / trapped by ACLs, so that
    capable drivers could register them with devlink. Add documentation for
    every added packet trap and packet trap group.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Add layer 3 control packet traps such as ARP and DHCP, so that capable
    device drivers could register them with devlink. Add documentation for
    every added packet trap and packet trap group.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Add layer 2 control packet traps such as STP and IGMP query, so that
    capable device drivers could register them with devlink. Add
    documentation for every added packet trap and packet trap group.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • This type is used for traps that trap control packets such as ARP
    request and IGMP query to the CPU.

    Do not report such packets to the kernel's drop monitor as they were not
    dropped by the device no encountered an exception during forwarding.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • The action is used by control traps such as IGMP query. The packet is
    flooded by the device, but also trapped to the CPU in order for the
    software bridge to mark the receiving port as a multicast router port.
    Such packets are marked with 'skb->offload_fwd_mark = 1' in order to
    prevent the software bridge from flooding them again.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Packets that hit exceptions during layer 3 forwarding must be trapped to
    the CPU for the control plane to function properly. Create a dedicated
    group for them, so that user space could choose to assign a different
    policer for them.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Drivers do not register to netdev events to set up indirect blocks
    anymore. Remove __flow_indr_block_cb_register() and
    __flow_indr_block_cb_unregister().

    The frontends set up the callbacks through flow_indr_dev_setup_block()

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Tunnel devices provide no dev->netdev_ops->ndo_setup_tc(...) interface.
    The tunnel device and route control plane does not provide an obvious
    way to relate tunnel and physical devices.

    This patch allows drivers to register a tunnel device offload handler
    for the tc and netfilter frontends through flow_indr_dev_register() and
    flow_indr_dev_unregister().

    The frontend calls flow_indr_dev_setup_offload() that iterates over the
    list of drivers that are offering tunnel device hardware offload
    support and it sets up the flow block for this tunnel device.

    If the driver module is removed, the indirect flow_block ends up with a
    stale callback reference. The module removal path triggers the
    dev_shutdown() path to remove the qdisc and the flow_blocks for the
    physical devices. However, this is not useful for tunnel devices, where
    relation between the physical and the tunnel device is not explicit.

    This patch introduces a cleanup callback that is invoked when the driver
    module is removed to clean up the tunnel device flow_block. This patch
    defines struct flow_block_indr and it uses it from flow_block_cb to
    store the information that front-end requires to perform the
    flow_block_cb cleanup on module removal.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

01 Jun, 2020

1 commit

  • xdp_umem.c had overlapping changes between the 64-bit math fix
    for the calculation of npgs and the removal of the zerocopy
    memory type which got rid of the chunk_size_nohdr member.

    The mlx5 Kconfig conflict is a case where we just take the
    net-next copy of the Kconfig entry dependency as it takes on
    the ESWITCH dependency by one level of indirection which is
    what the 'net' conflicting change is trying to ensure.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 May, 2020

1 commit

  • In commit 19e16d220f0a ("neigh: support smaller retrans_time settting")
    we add more accurate control for ARP and NS. But for ARP I forgot to
    update the latest guard in neigh_timer_handler(), then the next
    retransmit would be reset to jiffies + HZ/2 if we set the retrans_time
    less than 500ms. Fix it by setting the time_before() check to HZ/100.

    IPv6 does not have this issue.

    Reported-by: Jianwen Ji
    Fixes: 19e16d220f0a ("neigh: support smaller retrans_time settting")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu