25 Aug, 2020

1 commit

  • Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
    added to rtnetlink messages.

    Test commands:
    ip netns add nst
    ip link add dummy0 type dummy
    ip link add ipvlan0 link dummy0 type ipvlan
    ip link set ipvlan0 netns nst
    ip netns exec nst ip link show ipvlan0

    Result:
    ---Before---
    6: ipvlan0@if5: ...
    link/ether 82:3a:78:ab:60:50 brd ff:ff:ff:ff:ff:ff

    ---After---
    12: ipvlan0@if11: ...
    link/ether 42:b1:ad:57:4e:27 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    ~~~~~~~~~~~~~~

    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Taehee Yoo
     

17 Aug, 2020

1 commit

  • Processing NETDEV_FEAT_CHANGE causes IPvlan links to lose
    NETIF_F_LLTX feature because of the incorrect handling of
    features in ipvlan_fix_features().

    --before--
    lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
    tx-lockless: on [fixed]
    lpaa10:~# ethtool -K ipvl0 tso off
    Cannot change tcp-segmentation-offload
    Actual changes:
    vlan-challenged: off [fixed]
    tx-lockless: off [fixed]
    lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
    tx-lockless: off [fixed]
    lpaa10:~#

    --after--
    lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
    tx-lockless: on [fixed]
    lpaa10:~# ethtool -K ipvl0 tso off
    Cannot change tcp-segmentation-offload
    Could not change any device features
    lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
    tx-lockless: on [fixed]
    lpaa10:~#

    Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
    Signed-off-by: Mahesh Bandewar
    Cc: Eric Dumazet
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

05 May, 2020

1 commit

  • This patch reverts the folowing commits:

    commit 064ff66e2bef84f1153087612032b5b9eab005bd
    "bonding: add missing netdev_update_lockdep_key()"

    commit 53d374979ef147ab51f5d632dfe20b14aebeccd0
    "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"

    commit 1f26c0d3d24125992ab0026b0dab16c08df947c7
    "net: fix kernel-doc warning in "

    commit ab92d68fc22f9afab480153bd82a20f6e2533769
    "net: core: add generic lockdep keys"

    but keeps the addr_list_lock_key because we still lock
    addr_list_lock nestedly on stack devices, unlikely xmit_lock
    this is safe because we don't take addr_list_lock on any fast
    path.

    Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Cc: Taehee Yoo
    Signed-off-by: Cong Wang
    Acked-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Cong Wang
     

10 Mar, 2020

3 commits

  • Commit e18b353f102e ("ipvlan: add cond_resched_rcu() while
    processing muticast backlog") added a cond_resched_rcu() in a loop
    using rcu protection to iterate over slaves.

    This is breaking rcu rules, so lets instead use cond_resched()
    at a point we can reschedule

    Fixes: e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog")
    Signed-off-by: Eric Dumazet
    Cc: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If there are substantial number of slaves created as simulated by
    Syzbot, the backlog processing could take much longer and result
    into the issue found in the Syzbot report.

    INFO: rcu_sched detected stalls on CPUs/tasks:
    (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752)
    All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root ->qsmask 0x0
    syz-executor.1 R running task on cpu 1 10984 11210 3866 0x30020008 179034491270
    Call Trace:

    [] _sched_show_task kernel/sched/core.c:8063 [inline]
    [] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030
    [] sched_show_task+0xb/0x10 kernel/sched/core.c:8073
    [] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline]
    [] check_cpu_stall kernel/rcu/tree.c:1695 [inline]
    [] __rcu_pending kernel/rcu/tree.c:3478 [inline]
    [] rcu_pending kernel/rcu/tree.c:3540 [inline]
    [] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876
    [] update_process_times+0x32/0x80 kernel/time/timer.c:1635
    [] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161
    [] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193
    [] __run_hrtimer kernel/time/hrtimer.c:1393 [inline]
    [] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455
    [] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513
    [] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline]
    [] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056
    [] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
    RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153
    RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12
    RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000
    RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0
    RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273
    R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8
    R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0
    [] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline]
    [] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240
    [] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006
    [] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482
    [] dst_input include/net/dst.h:449 [inline]
    [] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78
    [] NF_HOOK include/linux/netfilter.h:292 [inline]
    [] NF_HOOK include/linux/netfilter.h:286 [inline]
    [] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278
    [] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303
    [] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417
    [] process_backlog+0x216/0x6c0 net/core/dev.c:6243
    [] napi_poll net/core/dev.c:6680 [inline]
    [] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748
    [] __do_softirq+0x2c8/0x99a kernel/softirq.c:317
    [] invoke_softirq kernel/softirq.c:399 [inline]
    [] irq_exit+0x16a/0x1a0 kernel/softirq.c:439
    [] exiting_irq arch/x86/include/asm/apic.h:561 [inline]
    [] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058
    [] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778

    RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102
    RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
    RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000
    RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005
    RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000
    R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
    [] do_futex+0x151/0x1d50 kernel/futex.c:3548
    [] C_SYSC_futex kernel/futex_compat.c:201 [inline]
    [] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175
    [] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline]
    [] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415
    [] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139
    RIP: 0023:0xf7f23c69
    RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0
    RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c
    RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
    rcu_sched R running task on cpu 1 13048 8 2 0x90000000 179099587640
    Call Trace:
    [] context_switch+0x60f/0xa60 kernel/sched/core.c:3209
    [] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934
    [] schedule+0x8f/0x1b0 kernel/sched/core.c:4011
    [] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803
    [] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327
    [] kthread+0x348/0x420 kernel/kthread.c:246
    [] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393

    Fixes: ba35f8588f47 (“ipvlan: Defer multicast / broadcast processing to a work-queue”)
    Signed-off-by: Mahesh Bandewar
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     
  • IPvlan in L3 mode discards outbound multicast packets but performs
    the check before ensuring the ether-header is set or not. This is
    an error that Eric found through code browsing.

    Fixes: 2ad7bf363841 (“ipvlan: Initial check-in of the IPVLAN driver.”)
    Signed-off-by: Mahesh Bandewar
    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

09 Mar, 2020

1 commit

  • There is a problem when ipvlan slaves are created on a master device that
    is a vmxnet3 device (ipvlan in VMware guests). The vmxnet3 driver does not
    support unicast address filtering. When an ipvlan device is brought up in
    ipvlan_open(), the ipvlan driver calls dev_uc_add() to add the hardware
    address of the vmxnet3 master device to the unicast address list of the
    master device, phy_dev->uc. This inevitably leads to the vmxnet3 master
    device being forced into promiscuous mode by __dev_set_rx_mode().

    Promiscuous mode is switched on the master despite the fact that there is
    still only one hardware address that the master device should use for
    filtering in order for the ipvlan device to be able to receive packets.
    The comment above struct net_device describes the uc_promisc member as a
    "counter, that indicates, that promiscuous mode has been enabled due to
    the need to listen to additional unicast addresses in a device that does
    not implement ndo_set_rx_mode()". Moreover, the design of ipvlan
    guarantees that only the hardware address of a master device,
    phy_dev->dev_addr, will be used to transmit and receive all packets from
    its ipvlan slaves. Thus, the unicast address list of the master device
    should not be modified by ipvlan_open() and ipvlan_stop() in order to make
    ipvlan a workable option on masters that do not support unicast address
    filtering.

    Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver")
    Reported-by: Per Sundstrom
    Signed-off-by: Jiri Wiesner
    Reviewed-by: Eric Dumazet
    Acked-by: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Jiri Wiesner
     

03 Nov, 2019

1 commit


25 Oct, 2019

1 commit

  • Some interface types could be nested.
    (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
    These interface types should set lockdep class because, without lockdep
    class key, lockdep always warn about unexisting circular locking.

    In the current code, these interfaces have their own lockdep class keys and
    these manage itself. So that there are so many duplicate code around the
    /driver/net and /net/.
    This patch adds new generic lockdep keys and some helper functions for it.

    This patch does below changes.
    a) Add lockdep class keys in struct net_device
    - qdisc_running, xmit, addr_list, qdisc_busylock
    - these keys are used as dynamic lockdep key.
    b) When net_device is being allocated, lockdep keys are registered.
    - alloc_netdev_mqs()
    c) When net_device is being free'd llockdep keys are unregistered.
    - free_netdev()
    d) Add generic lockdep key helper function
    - netdev_register_lockdep_key()
    - netdev_unregister_lockdep_key()
    - netdev_update_lockdep_key()
    e) Remove unnecessary generic lockdep macro and functions
    f) Remove unnecessary lockdep code of each interfaces.

    After this patch, each interface modules don't need to maintain
    their lockdep keys.

    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Taehee Yoo
     

11 Oct, 2019

1 commit


17 Aug, 2019

1 commit

  • Allow encapsulated packets sent to tunnels layered over ipvlan to use
    offloads rather than forcing SW fallbacks.

    Since commit f21e5077010acda73a60 ("macvlan: add offload features for
    encapsulation"), macvlan has set dev->hw_enc_features to include
    everything in dev->features; do likewise in ipvlan.

    Signed-off-by: Bill Sommerfeld
    Acked-by: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Bill Sommerfeld
     

08 Jun, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

    2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

    3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

    4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

    5) Memory leak on unload in RDS, from Zhu Yanjun.

    6) sctp_process_init leak, from Neil HOrman.

    7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
    pktgen: do not sleep with the thread lock held.
    net: mvpp2: Use strscpy to handle stat strings
    net: rds: fix memory leak in rds_ib_flush_mr_pool
    ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
    ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
    Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
    net: aquantia: fix wol configuration not applied sometimes
    ethtool: fix potential userspace buffer overflow
    Fix memory leak in sctp_process_init
    net: rds: fix memory leak when unload rds_rdma
    ipv6: fix the check before getting the cookie in rt6_get_cookie
    ipv4: not do cache for local delivery if bc_forwarding is enabled
    s390/qeth: handle error when updating TX queue count
    s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
    s390/qeth: check dst entry before use
    s390/qeth: handle limited IPv4 broadcast in L3 TX path
    net: fix indirect calls helpers for ptype list hooks.
    net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
    udp: only choose unbound UDP socket for multicast when not in a VRF
    net/tls: replace the sleeping lock around RX resync with a bit lock
    ...

    Linus Torvalds
     

05 Jun, 2019

1 commit

  • There's some NICs, such as hinic, with NETIF_F_IP_CSUM and NETIF_F_TSO
    on but NETIF_F_HW_CSUM off. And ipvlan device features will be
    NETIF_F_TSO on with NETIF_F_IP_CSUM and NETIF_F_IP_CSUM both off as
    IPVLAN_FEATURES only care about NETIF_F_HW_CSUM. So TSO will be
    disabled in netdev_fix_features.
    For example:
    Features for enp129s0f0:
    rx-checksumming: on
    tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on

    Fixes: a188222b6ed2 ("net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK")
    Signed-off-by: Miaohe Lin
    Signed-off-by: David S. Miller

    Miaohe Lin
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

2 commits


25 Feb, 2019

1 commit

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Feb, 2019

1 commit

  • When running Docker with userns isolation e.g. --userns-remap="default"
    and spawning up some containers with CAP_NET_ADMIN under this realm, I
    noticed that link changes on ipvlan slave device inside that container
    can affect all devices from this ipvlan group which are in other net
    namespaces where the container should have no permission to make changes
    to, such as the init netns, for example.

    This effectively allows to undo ipvlan private mode and switch globally to
    bridge mode where slaves can communicate directly without going through
    hostns, or it allows to switch between global operation mode (l2/l3/l3s)
    for everyone bound to the given ipvlan master device. libnetwork plugin
    here is creating an ipvlan master and ipvlan slave in hostns and a slave
    each that is moved into the container's netns upon creation event.

    * In hostns:

    # ip -d a
    [...]
    8: cilium_host@bond0: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.0.1/32 scope link cilium_host
    valid_lft forever preferred_lft forever
    [...]

    * Spawn container & change ipvlan mode setting inside of it:

    # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
    9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c

    # docker exec -ti client ip -d a
    [...]
    10: cilium0@if4: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
    valid_lft forever preferred_lft forever

    # docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2

    # docker exec -ti client ip -d a
    [...]
    10: cilium0@if4: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
    valid_lft forever preferred_lft forever

    * In hostns (mode switched to l2):

    # ip -d a
    [...]
    8: cilium_host@bond0: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.0.1/32 scope link cilium_host
    valid_lft forever preferred_lft forever
    [...]

    Same l3 -> l2 switch would also happen by creating another slave inside
    the container's network namespace when specifying the existing cilium0
    link to derive the actual (bond0) master:

    # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2

    # docker exec -ti client ip -d a
    [...]
    2: cilium1@if4: mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    10: cilium0@if4: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
    valid_lft forever preferred_lft forever

    * In hostns:

    # ip -d a
    [...]
    8: cilium_host@bond0: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.41.0.1/32 scope link cilium_host
    valid_lft forever preferred_lft forever
    [...]

    One way to mitigate it is to check CAP_NET_ADMIN permissions of
    the ipvlan master device's ns, and only then allow to change
    mode or flags for all devices bound to it. Above two cases are
    then disallowed after the patch.

    Signed-off-by: Daniel Borkmann
    Acked-by: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Feb, 2019

2 commits

  • An ipvlan bug fix in 'net' conflicted with the abstraction away
    of the IPV6 specific support in 'net-next'.

    Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
    action conversion in 'net-next'.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Right now ipvlan has a hard dependency on CONFIG_NETFILTER and
    otherwise it cannot be built. However, the only ipvlan operation
    mode that actually depends on netfilter is l3s, everything else
    is independent of it. Break this hard dependency such that users
    are able to use ipvlan l3 mode on systems where netfilter is not
    compiled in.

    Therefore, this adds a hidden CONFIG_IPVLAN_L3S bool which is
    defaulting to y when CONFIG_NETFILTER is set in order to retain
    existing behavior for l3s. All l3s related code is refactored
    into ipvlan_l3s.c that is compiled in when enabled.

    Signed-off-by: Daniel Borkmann
    Cc: Mahesh Bandewar
    Cc: Florian Westphal
    Cc: Martynas Pumputis
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

31 Jan, 2019

1 commit

  • While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
    I ran into the issue that while l3 mode is working fine, l3s mode
    does not have any connectivity to kube-apiserver and hence all pods
    end up in Error state as well. The ipvlan master device sits on
    top of a bond device and hostns traffic to kube-apiserver (also running
    in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
    where the latter is the address of the bond0. While in l3 mode, a
    curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
    works fine from hostns, neither of them do in case of l3s. In the
    latter only a curl to https://127.0.0.1:37573 appeared to work where
    for local addresses of bond0 I saw kernel suddenly starting to emit
    ARP requests to query HW address of bond0 which remained unanswered
    and neighbor entries in INCOMPLETE state. These ARP requests only
    happen while in l3s.

    Debugging this further, I found the issue is that l3s mode is piggy-
    backing on l3 master device, and in this case local routes are using
    l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
    f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
    if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
    a loopback"). I found that reverting them back into using the
    net->loopback_dev fixed ipvlan l3s connectivity and got everything
    working for the CNI.

    Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
    l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
    on l3 master device is to get the l3mdev_ip_rcv() receive hook for
    setting the dst entry of the input route without adding its own
    ipvlan specific hacks into the receive path, however, any l3 domain
    semantics beyond just that are breaking l3s operation. Note that
    ipvlan also has the ability to dynamically switch its internal
    operation from l3 to l3s for all ports via ipvlan_set_port_mode()
    at runtime. In any case, l3 vs l3s soley distinguishes itself by
    'de-confusing' netfilter through switching skb->dev to ipvlan slave
    device late in NF_INET_LOCAL_IN before handing the skb to L4.

    Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
    if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
    without any additional l3mdev semantics on top. This should also have
    minimal impact since dev->priv_flags is already hot in cache. With
    this set, l3s mode is working fine and I also get things like
    masquerading pod traffic on the ipvlan master properly working.

    [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

    Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
    Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
    Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
    Signed-off-by: Daniel Borkmann
    Cc: Mahesh Bandewar
    Cc: David Ahern
    Cc: Florian Westphal
    Cc: Martynas Pumputis
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

14 Dec, 2018

1 commit

  • A NETDEV_CHANGEADDR event implies a change of address of each of the
    IPVLANs of this IPVLAN device. Therefore propagate NETDEV_PRE_CHANGEADDR
    to all the IPVLANs.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     

11 Dec, 2018

1 commit

  • Fix following gcc warning:

    drivers/net/ipvlan/ipvlan_main.c:543:12: warning:
    comparison is always false due to limited range of data type [-Wtype-limits]

    'mode' is a u16 variable, IPVLAN_MODE_L2 is zero,
    the comparison is always false

    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

07 Dec, 2018

2 commits

  • In order to pass extack together with NETDEV_PRE_UP notifications, it's
    necessary to route the extack to __dev_open() from diverse (possibly
    indirect) callers. One prominent API through which the notification is
    invoked is dev_change_flags().

    Therefore extend dev_change_flags() with and extra extack argument and
    update all users. Most of the calls end up just encoding NULL, but
    several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available.

    Since the function declaration line is changed anyway, name the other
    function arguments to placate checkpatch.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Petr Machata
     
  • A follow-up patch will extend dev_change_flags() with an extack
    argument. Extend ipvlan_set_port_mode() to have that argument available
    for the conversion.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     

02 Jul, 2018

1 commit

  • After we change the ipvlan mode from l3 to l2, or vice versa, we only
    reset IFF_NOARP flag, but don't flush the ARP table cache, which will
    cause eth->h_dest to be equal to eth->h_source in ipvlan_xmit_mode_l2().
    Then the message will not come out of host.

    Here is the reproducer on local host:

    ip link set eth1 up
    ip addr add 192.168.1.1/24 dev eth1
    ip link add link eth1 ipvlan1 type ipvlan mode l3

    ip netns add net1
    ip link set ipvlan1 netns net1
    ip netns exec net1 ip link set ipvlan1 up
    ip netns exec net1 ip addr add 192.168.2.1/24 dev ipvlan1

    ip route add 192.168.2.0/24 via 192.168.1.2
    ping 192.168.2.2 -c 2

    ip netns exec net1 ip link set ipvlan1 type ipvlan mode l2
    ping 192.168.2.2 -c 2

    Add the same configuration on remote host. After we set the mode to l2,
    we could find that the src/dst MAC addresses are the same on eth1:

    21:26:06.648565 00:b7:13:ad:d3:05 > 00:b7:13:ad:d3:05, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 58356, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.2.1 > 192.168.2.2: ICMP echo request, id 22686, seq 1, length 64

    Fix this by calling dev_change_flags(), which will call netdevice notifier
    with flag change info.

    v2:
    a) As pointed out by Wang Cong, check return value for dev_change_flags() when
    change dev flags.
    b) As suggested by Stefano and Sabrina, move flags setting before l3mdev_ops.
    So we don't need to redo ipvlan_{, un}register_nf_hook() again in err path.

    Reported-by: Jianlin Shi
    Reviewed-by: Stefano Brivio
    Reviewed-by: Sabrina Dubroca
    Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver.")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

21 Jun, 2018

1 commit

  • Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
    the mtu from the master device when creating a ipvlan device, but it
    would also override the mtu value set in rtnl_create_link. It causes
    IFLA_MTU param not to take effect.

    So this patch is to not adjust the mtu if IFLA_MTU param is set when
    creating a ipvlan device.

    Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
    Reported-by: Jianlin Shi
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

20 Jun, 2018

1 commit

  • Similar to the fixes on team and bonding, this restores the ability
    to set an ipvlan device's mtu to anything higher than 1500.

    Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

16 May, 2018

1 commit


28 Mar, 2018

1 commit


10 Mar, 2018

1 commit

  • Some network devices - notably ipvlan slave - are not compatible with
    any kind of rx_handler. Currently the hook can be installed but any
    configuration (bridge, bond, macsec, ...) is nonfunctional.

    This change allocates a priv_flag bit to mark such devices and explicitly
    forbid installing a rx_handler if such bit is set. The new bit is used
    by ipvlan slave device.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

09 Mar, 2018

1 commit

  • The rx_handler field is rcu-protected, but I forgot to use the
    proper accessor while refactoring netif_is_ipvlan_port(). Such
    function only check the rx_handler value, so it is safe, but we need
    to properly read rx_handler via rcu_access_pointer() to avoid sparse
    warnings.

    Fixes: 1ec54cb44e67 ("net: unpollute priv_flags space")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Mar, 2018

1 commit

  • the ipvlan device driver defines and uses 2 bits inside the priv_flags
    net_device field. Such bits and the related helper are used only
    inside the ipvlan device driver, and the core networking does not
    need to be aware of them.

    This change moves netif_is_ipvlan* helper in the ipvlan driver and
    re-implement them looking for ipvlan specific symbols instead of
    using priv_flags.

    Overall this frees two bits inside priv_flags - and move the following
    ones to avoid gaps - without any intended functional change.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

05 Mar, 2018

2 commits

  • Currently we allow the creation of 8021q devices on top of
    ipvlan, but such devices are nonfunctional, as the underlying
    ipvlan rx_hanlder hook can't match the relevant traffic.

    Be explicit and forbid the creation of such nonfunctional devices.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • IPv6 does path selection for multipath routes deep in the lookup
    functions. The next patch adds L4 hash option and needs the skb
    for the forward path. To get the skb to the relevant FIB lookup
    functions it needs to go through the fib rules layer, so add a
    lookup_data argument to the fib_lookup_arg struct.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    David Ahern
     

01 Mar, 2018

2 commits

  • This changeset moves ipvlan address under RCU protection, using
    a per ipvlan device spinlock to protect list mutation and RCU
    read access to protect list traversal.

    Also explicitly use RCU read lock to traverse the per port
    ipvlans list, so that we can now perform a full address lookup
    without asserting the RTNL lock.

    Overall this allows the ipvlan driver to check fully for duplicate
    addresses - before this commit ipv6 addresses assigned by autoconf
    via prefix delegation where accepted without any check - and avoid
    the following rntl assertion failure still in the same code path:

    RTNL: assertion failed at drivers/net/ipvlan/ipvlan_core.c (124)
    WARNING: CPU: 15 PID: 0 at drivers/net/ipvlan/ipvlan_core.c:124 ipvlan_addr_busy+0x97/0xa0 [ipvlan]
    Modules linked in: ipvlan(E) ixgbe
    CPU: 15 PID: 0 Comm: swapper/15 Tainted: G E 4.16.0-rc2.ipvlan+ #1782
    Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
    RIP: 0010:ipvlan_addr_busy+0x97/0xa0 [ipvlan]
    RSP: 0018:ffff881ff9e03768 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff881fdf2a9000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 00000000000000f6 RDI: 0000000000000300
    RBP: ffff881fdf2a8000 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffff881ff9e034c0 R12: ffff881fe07bcc00
    R13: 0000000000000001 R14: ffffffffa02002b0 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff881ff9e00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc5c1a4f248 CR3: 000000207e012005 CR4: 00000000001606e0
    Call Trace:

    ipvlan_addr6_event+0x6c/0xd0 [ipvlan]
    notifier_call_chain+0x49/0x90
    atomic_notifier_call_chain+0x6a/0x100
    ipv6_add_addr+0x5f9/0x720
    addrconf_prefix_rcv_add_addr+0x244/0x3c0
    addrconf_prefix_rcv+0x2f3/0x790
    ndisc_router_discovery+0x633/0xb70
    ndisc_rcv+0x155/0x180
    icmpv6_rcv+0x4ac/0x5f0
    ip6_input_finish+0x138/0x6a0
    ip6_input+0x41/0x1f0
    ipv6_rcv+0x4db/0x8d0
    __netif_receive_skb_core+0x3d5/0xe40
    netif_receive_skb_internal+0x89/0x370
    napi_gro_receive+0x14f/0x1e0
    ixgbe_clean_rx_irq+0x4ce/0x1020 [ixgbe]
    ixgbe_poll+0x31a/0x7a0 [ixgbe]
    net_rx_action+0x296/0x4f0
    __do_softirq+0xcf/0x4f5
    irq_exit+0xf5/0x110
    do_IRQ+0x62/0x110
    common_interrupt+0x91/0x91

    v1 -> v2: drop unneeded in_softirq check in ipvlan_addr6_validator_event()

    Fixes: e9997c2938b2 ("ipvlan: fix check for IP addresses in control path")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Currently, if IPv6 is enabled on top of an ipvlan device in l3
    mode, the following warning message:

    Dropped {multi|broad}cast of type= [86dd]

    is emitted every time that a RS is generated and dmseg is soon
    filled with irrelevant messages. Replace pr_warn with pr_debug,
    to preserve debuggability, without scaring the sysadmin.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

28 Feb, 2018

1 commit

  • These pernet_operations unregister ipvlan net hooks.
    nf_unregister_net_hooks() removes hooks one-by-one,
    and then frees the memory via rcu. This looks similar
    to that happens, when a new hooks is added: allocation
    of bigger memory region, copy of old content, and rcu
    freeing the old memory. So, all of net code should be
    well with this behavior. Also at the time of hook
    unregistering, there are no packets, and foreign net
    pernet_operations are not interested in others hooks.
    So, we mark them as async.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

22 Feb, 2018

1 commit

  • IPVlan has an hard dependency on IPv6, refactor the ipvlan code to allow
    compiling it with IPv6 disabled, move duplicate code into addr_equal()
    and refactor series of if-else into a switch.

    Signed-off-by: Matteo Croce
    Signed-off-by: David S. Miller

    Matteo Croce
     

16 Dec, 2017

1 commit

  • IPvlan currently scrubs packets at every location where packets may be
    crossing namespace boundary. Though this is desirable, currently IPvlan
    does it more than necessary. e.g. packets that are going to take
    dev_forward_skb() path will get scrubbed so no point in scrubbing them
    before forwarding. Another side-effect of scrubbing is that pkt-type gets
    set to PACKET_HOST which overrides what was already been set by the
    earlier path making erroneous delivery of the packets.

    Also scrubbing packets just before calling dev_queue_xmit() has detrimental
    effects since packets lose skb->sk and because of that miss prio updates,
    incorrect socket back-pressure and would even break TSQ.

    Fixes: b93dd49c1a35 ('ipvlan: Scrub skb before crossing the namespace boundary')
    Signed-off-by: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Mahesh Bandewar