05 Dec, 2020

1 commit

  • with the following tdc testcase:

    83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows

    as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the
    timer is not yet initialized, it's possible to observe a splat like this:

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 0 PID: 975 Comm: tc Not tainted 5.10.0-rc4+ #298
    Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
    Call Trace:
    dump_stack+0x99/0xcb
    register_lock_class+0x12dd/0x1750
    __lock_acquire+0xfe/0x3970
    lock_acquire+0x1c8/0x7f0
    del_timer_sync+0x49/0xd0
    fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
    qdisc_create+0x916/0x1160
    tc_modify_qdisc+0x3c4/0x1630
    rtnetlink_rcv_msg+0x346/0x8e0
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x719/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5ba/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [...]
    ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0
    WARNING: CPU: 0 PID: 975 at lib/debugobjects.c:508 debug_print_object+0x162/0x210
    [...]
    Call Trace:
    debug_object_assert_init+0x268/0x380
    try_to_del_timer_sync+0x6a/0x100
    del_timer_sync+0x9e/0xd0
    fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
    qdisc_create+0x916/0x1160
    tc_modify_qdisc+0x3c4/0x1630
    rtnetlink_rcv_msg+0x346/0x8e0
    netlink_rcv_skb+0x120/0x380
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x719/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5ba/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    fix it moving timer_setup() before any failure, like it was done on 'red'
    with former commit 608b4adab178 ("net_sched: initialize timer earlier in
    red_init()").

    Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
    Signed-off-by: Davide Caratti
    Reviewed-by: Cong Wang
    Link: https://lore.kernel.org/r/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     

06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

17 Jul, 2020

2 commits

  • This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

    Signed-off-by: Petr Machata
    Signed-off-by: Jakub Kicinski

    Petr Machata
     
  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jul, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

30 Jun, 2020

1 commit

  • A following patch introduces qevents, points in qdisc algorithm where
    packet can be processed by user-defined filters. Should this processing
    lead to a situation where a new packet is to be enqueued on the same port,
    holding the root lock would lead to deadlocks. To solve the issue, qevent
    handler needs to unlock and relock the root lock when necessary.

    To that end, add the root lock argument to the qdisc op enqueue, and
    propagate throughout.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     

28 May, 2020

1 commit

  • this command hangs forever:

    # tc qdisc add dev eth0 root fq_pie flows 65536

    watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [tc:1028]
    [...]
    CPU: 1 PID: 1028 Comm: tc Not tainted 5.7.0-rc6+ #167
    RIP: 0010:fq_pie_init+0x60e/0x8b7 [sch_fq_pie]
    Code: 4c 89 65 50 48 89 f8 48 c1 e8 03 42 80 3c 30 00 0f 85 2a 02 00 00 48 8d 7d 10 4c 89 65 58 48 89 f8 48 c1 e8 03 42 80 3c 30 00 85 a7 01 00 00 48 8d 7d 18 48 c7 45 10 46 c3 23 00 48 89 f8 48
    RSP: 0018:ffff888138d67468 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    RAX: 1ffff9200018d2b2 RBX: ffff888139c1c400 RCX: ffffffffffffffff
    RDX: 000000000000c5e8 RSI: ffffc900000e5000 RDI: ffffc90000c69590
    RBP: ffffc90000c69580 R08: fffffbfff79a9699 R09: fffffbfff79a9699
    R10: 0000000000000700 R11: fffffbfff79a9698 R12: ffffc90000c695d0
    R13: 0000000000000000 R14: dffffc0000000000 R15: 000000002347c5e8
    FS: 00007f01e1850e40(0000) GS:ffff88814c880000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000067c340 CR3: 000000013864c000 CR4: 0000000000340ee0
    Call Trace:
    qdisc_create+0x3fd/0xeb0
    tc_modify_qdisc+0x3be/0x14a0
    rtnetlink_rcv_msg+0x5f3/0x920
    netlink_rcv_skb+0x121/0x350
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x714/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5b4/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x9a/0x370
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    we can't accept 65536 as a valid number for 'nflows', because the loop on
    'idx' in fq_pie_init() will never end. The extack message is correct, but
    it doesn't say that 0 is not a valid number for 'flows': while at it, fix
    this also. Add a tdc selftest to check correct validation of 'flows'.

    CC: Ivan Vecera
    Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
    Signed-off-by: Davide Caratti
    Reviewed-by: Ivan Vecera
    Signed-off-by: David S. Miller

    Davide Caratti
     

05 Mar, 2020

1 commit

  • The variable pie_vars->accu_prob is used as an accumulator for
    probability values. Since probabilty values are scaled using the
    MAX_PROB macro denoting (2^64 - 1), pie_vars->accu_prob is
    likely to overflow as it is of type u64.

    The variable pie_vars->accu_prob_overflows counts the number of
    times the variable pie_vars->accu_prob overflows.

    The MAX_PROB macro needs to be equal to at least (2^39 - 1) in
    order to do precise calculations without any underflow. Thus
    MAX_PROB can be reduced to (2^56 - 1) without affecting the
    precision in calculations drastically. Doing so will eliminate
    the need for the variable pie_vars->accu_prob_overflows as the
    variable pie_vars->accu_prob will never overflow.

    Removing the variable pie_vars->accu_prob_overflows also reduces
    the size of the structure pie_vars to exactly 64 bytes.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: Leslie Monis
    Signed-off-by: David S. Miller

    Leslie Monis
     

06 Feb, 2020

1 commit

  • The bug is that we call kfree_skb(skb) and then pass "skb" to
    qdisc_pkt_len(skb) on the next line, which is a use after free.
    Also Cong Wang points out that it's better to delay the actual
    frees until we drop the rtnl lock so we should use rtnl_kfree_skbs()
    instead of kfree_skb().

    Cc: Cong Wang
    Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
    Signed-off-by: Dan Carpenter
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dan Carpenter
     

23 Jan, 2020

1 commit

  • Principles:
    - Packets are classified on flows.
    - This is a Stochastic model (as we use a hash, several flows might
    be hashed to the same slot)
    - Each flow has a PIE managed queue.
    - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
    - For a given flow, packets are not reordered.
    - Drops during enqueue only.
    - ECN capability is off by default.
    - ECN threshold (if ECN is enabled) is at 10% by default.
    - Uses timestamps to calculate queue delay by default.

    Usage:
    tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
    [ target TIME ] [ tupdate TIME ]
    [ alpha NUMBER ] [ beta NUMBER ]
    [ quantum BYTES ] [ memory_limit BYTES ]
    [ ecnprob PERCENTAGE ] [ [no]ecn ]
    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

    defaults:
    limit: 10240 packets, flows: 1024
    target: 15 ms, tupdate: 15 ms (in jiffies)
    alpha: 1/8, beta : 5/4
    quantum: device MTU, memory_limit: 32 Mb
    ecnprob: 10%, ecn: off
    bytemode: off, dq_rate_estimator: off

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Sachin D. Patil
    Signed-off-by: V. Saicharan
    Signed-off-by: Mohit Bhasi
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani