06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

17 Jul, 2020

2 commits

  • This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

    Signed-off-by: Petr Machata
    Signed-off-by: Jakub Kicinski

    Petr Machata
     
  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jul, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

30 Jun, 2020

1 commit

  • A following patch introduces qevents, points in qdisc algorithm where
    packet can be processed by user-defined filters. Should this processing
    lead to a situation where a new packet is to be enqueued on the same port,
    holding the root lock would lead to deadlocks. To solve the issue, qevent
    handler needs to unlock and relock the root lock when necessary.

    To that end, add the root lock argument to the qdisc op enqueue, and
    propagate throughout.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     

27 Sep, 2019

1 commit

  • Recent changes that removed rtnl dependency from rules update path of tc
    also made tcf_block_put() function sleeping. This function is called from
    ops->destroy() of several Qdisc implementations, which in turn is called by
    qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
    which results sleeping-while-atomic BUG.

    Steps to reproduce for htb:

    tc qdisc add dev ens1f0 root handle 1: htb default 12
    tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps
    tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10
    tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps

    Resulting dmesg:

    [ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
    [ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc
    [ 4791.152805] INFO: lockdep is turned off.
    [ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G W 5.3.0-rc8+ #721
    [ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
    [ 4791.155075] Call Trace:
    [ 4791.155803] dump_stack+0x85/0xc0
    [ 4791.156529] ___might_sleep.cold+0xac/0xbc
    [ 4791.157251] __mutex_lock+0x5b/0x960
    [ 4791.157966] ? console_unlock+0x363/0x5d0
    [ 4791.158676] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 4791.159395] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 4791.160103] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 4791.160815] tcf_block_put_ext.part.0+0x21/0x50
    [ 4791.161530] tcf_block_put+0x50/0x70
    [ 4791.162233] sfq_destroy+0x15/0x50 [sch_sfq]
    [ 4791.162936] qdisc_destroy+0x5f/0x160
    [ 4791.163642] htb_change_class.cold+0x5df/0x69d [sch_htb]
    [ 4791.164505] tc_ctl_tclass+0x19d/0x480
    [ 4791.165360] rtnetlink_rcv_msg+0x170/0x4b0
    [ 4791.166191] ? netlink_deliver_tap+0x95/0x400
    [ 4791.166907] ? rtnl_dellink+0x2d0/0x2d0
    [ 4791.167625] netlink_rcv_skb+0x49/0x110
    [ 4791.168345] netlink_unicast+0x171/0x200
    [ 4791.169058] netlink_sendmsg+0x224/0x3f0
    [ 4791.169771] sock_sendmsg+0x5e/0x60
    [ 4791.170475] ___sys_sendmsg+0x2ae/0x330
    [ 4791.171183] ? ___sys_recvmsg+0x159/0x1f0
    [ 4791.171894] ? do_wp_page+0x9c/0x790
    [ 4791.172595] ? __handle_mm_fault+0xcd3/0x19e0
    [ 4791.173309] __sys_sendmsg+0x59/0xa0
    [ 4791.174024] do_syscall_64+0x5c/0xb0
    [ 4791.174725] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 4791.175435] RIP: 0033:0x7f0aa41497b8
    [ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
    4
    [ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    [ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8
    [ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003
    [ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0
    [ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
    [ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

    In htb_change_class() function save parent->leaf.q to local temporary
    variable and put reference to it after sch tree lock is released in order
    not to call potentially sleeping cls API in atomic section. This is safe to
    do because Qdisc has already been reset by qdisc_purge_queue() inside sch
    tree lock critical section.

    Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex")
    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 May, 2019

1 commit

  • In commit 3c75f6ee139d ("net_sched: sch_htb: add per class overlimits counter")
    we added an overlimits counter for each HTB class which could
    properly reflect how many times we use up all the bandwidth
    on each class. However, the overlimits counter in HTB qdisc
    does not, it is way bigger than the sum of each HTB class.
    In fact, this qdisc overlimits counter increases when we have
    no skb to dequeue, which happens more often than we run out of
    bandwidth.

    It makes more sense to make this qdisc overlimits counter just
    be a sum of each HTB class, in case people still get confused.

    I have verified this patch with one single HTB class, where HTB
    qdisc counters now always match HTB class counters as expected.

    Eric suggested we could fold this field into 'direct_pkts' as
    we only use its 32bit on 64bit CPU, this saves one cache line.

    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cong Wang
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

02 Apr, 2019

2 commits

  • The same code to flush qdisc tree and purge the qdisc queue
    is duplicated in many places and in most cases it does not
    respect NOLOCK qdisc: the global backlog len is used and the
    per CPU values are ignored.

    This change addresses the above, factoring-out the relevant
    code and using the helpers introduced by the previous patch
    to fetch the correct backlog len.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Classful qdiscs can't access directly the child qdiscs backlog
    length: if such qdisc is NOLOCK, per CPU values should be
    accounted instead.

    Most qdiscs no not respect the above. As a result, qstats fetching
    for most classful qdisc is currently incorrect: if the child qdisc is
    NOLOCK, it always reports 0 len backlog.

    This change introduces a pair of helpers to safely fetch
    both backlog and qlen and use them in stats class dumping
    functions, fixing the above issue and cleaning a bit the code.

    DRR needs also to access the child qdisc queue length, so it
    needs custom handling.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

16 Jan, 2019

1 commit

  • Parent qdiscs may dereference the pointer to the enqueued skb after
    enqueue. However, both CAKE and TBF call consume_skb() on the original skb
    when splitting GSO packets, leading to a potential use-after-free in the
    parent. Fix this by avoiding dereferencing the skb pointer after enqueueing
    to the child.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

26 Sep, 2018

1 commit

  • Current implementation of qdisc_destroy() decrements Qdisc reference
    counter and only actually destroy Qdisc if reference counter value reached
    zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
    describe the way in which this function currently implemented and used.

    Extract code that deallocates Qdisc into new private qdisc_destroy()
    function. It is intended to be shared between regular qdisc_put() and its
    unlocked version that is introduced in next patch in this series.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

11 Sep, 2018

2 commits


24 Jun, 2018

1 commit


02 Apr, 2018

1 commit


22 Dec, 2017

7 commits

  • This patch adds extack support for the function qdisc_create_dflt which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why qdisc_create_dflt failed. The function qdisc_create_dflt
    will also call an init callback which can fail by any per-qdisc specific
    handling.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for the function tcf_block_get which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why tcf_block_get failed.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for the function qdisc_get_rtab which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why qdisc_get_rtab failed.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for graft callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for block callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for class change callback api. This prepares
    to handle extack support inside each specific class implementation.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for init callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     

22 Oct, 2017

1 commit


17 Oct, 2017

1 commit


19 Sep, 2017

1 commit

  • HTB qdisc overlimits counter is properly increased, but we have no per
    class counter, meaning it is difficult to diagnose HTB problems.

    This patch adds this counter, visible in "tc -s class show dev eth0",
    with current iproute2.

    Signed-off-by: Eric Dumazet
    Reported-by: Denys Fedoryshchenko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2017

1 commit


31 Aug, 2017

1 commit

  • The commit below added a call to the ->destroy() callback for all qdiscs
    which failed in their ->init(), but some were not prepared for such
    change and can't handle partially initialized qdisc. HTB is one of them
    and if any error occurs before the qdisc watchdog timer and qdisc work are
    initialized then we can hit either a null ptr deref (timer->base) when
    canceling in ->destroy or lockdep error info about trying to register
    a non-static key and a stack dump. So to fix these two move the watchdog
    timer and workqueue init before anything that can err out.
    To reproduce userspace needs to send broken htb qdisc create request,
    tested with a modified tc (q_htb.c).

    Trace log:
    [ 2710.897602] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 2710.897977] IP: hrtimer_active+0x17/0x8a
    [ 2710.898174] PGD 58fab067
    [ 2710.898175] P4D 58fab067
    [ 2710.898353] PUD 586c0067
    [ 2710.898531] PMD 0
    [ 2710.898710]
    [ 2710.899045] Oops: 0000 [#1] SMP
    [ 2710.899232] Modules linked in:
    [ 2710.899419] CPU: 1 PID: 950 Comm: tc Not tainted 4.13.0-rc6+ #54
    [ 2710.899646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [ 2710.900035] task: ffff880059ed2700 task.stack: ffff88005ad4c000
    [ 2710.900262] RIP: 0010:hrtimer_active+0x17/0x8a
    [ 2710.900467] RSP: 0018:ffff88005ad4f960 EFLAGS: 00010246
    [ 2710.900684] RAX: 0000000000000000 RBX: ffff88003701e298 RCX: 0000000000000000
    [ 2710.900933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003701e298
    [ 2710.901177] RBP: ffff88005ad4f980 R08: 0000000000000001 R09: 0000000000000001
    [ 2710.901419] R10: ffff88005ad4f800 R11: 0000000000000400 R12: 0000000000000000
    [ 2710.901663] R13: ffff88003701e298 R14: ffffffff822a4540 R15: ffff88005ad4fac0
    [ 2710.901907] FS: 00007f2f5e90f740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
    [ 2710.902277] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2710.902500] CR2: 0000000000000000 CR3: 0000000058ca3000 CR4: 00000000000406e0
    [ 2710.902744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 2710.902977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 2710.903180] Call Trace:
    [ 2710.903332] hrtimer_try_to_cancel+0x1a/0x93
    [ 2710.903504] hrtimer_cancel+0x15/0x20
    [ 2710.903667] qdisc_watchdog_cancel+0x12/0x14
    [ 2710.903866] htb_destroy+0x2e/0xf7
    [ 2710.904097] qdisc_create+0x377/0x3fd
    [ 2710.904330] tc_modify_qdisc+0x4d2/0x4fd
    [ 2710.904511] rtnetlink_rcv_msg+0x188/0x197
    [ 2710.904682] ? rcu_read_unlock+0x3e/0x5f
    [ 2710.904849] ? rtnl_newlink+0x729/0x729
    [ 2710.905017] netlink_rcv_skb+0x6c/0xce
    [ 2710.905183] rtnetlink_rcv+0x23/0x2a
    [ 2710.905345] netlink_unicast+0x103/0x181
    [ 2710.905511] netlink_sendmsg+0x326/0x337
    [ 2710.905679] sock_sendmsg_nosec+0x14/0x3f
    [ 2710.905847] sock_sendmsg+0x29/0x2e
    [ 2710.906010] ___sys_sendmsg+0x209/0x28b
    [ 2710.906176] ? do_raw_spin_unlock+0xcd/0xf8
    [ 2710.906346] ? _raw_spin_unlock+0x27/0x31
    [ 2710.906514] ? __handle_mm_fault+0x651/0xdb1
    [ 2710.906685] ? check_chain_key+0xb0/0xfd
    [ 2710.906855] __sys_sendmsg+0x45/0x63
    [ 2710.907018] ? __sys_sendmsg+0x45/0x63
    [ 2710.907185] SyS_sendmsg+0x19/0x1b
    [ 2710.907344] entry_SYSCALL_64_fastpath+0x23/0xc2

    Note that probably this bug goes further back because the default qdisc
    handling always calls ->destroy on init failure too.

    Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
    Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

26 Aug, 2017

1 commit

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

17 Aug, 2017

1 commit

  • This callback is used for deactivating class in parent qdisc.
    This is cheaper to test queue length right here.

    Also this allows to catch draining screwed backlog and prevent
    second deactivation of already inactive parent class which will
    crash kernel for sure. Kernel with print warning at destruction
    of child qdisc where no packets but backlog is not zero.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

16 Aug, 2017

1 commit

  • Traffic filters could keep direct pointers to classes in classful qdisc,
    thus qdisc destruction first removes all filters before freeing classes.
    Class destruction methods also tries to free attached filters but now
    this isn't safe because tcf_block_put() unlike to tcf_destroy_chain()
    cannot be called second time.

    This patch set class->block to NULL after first tcf_block_put() and
    turn second call into no-op.

    Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

07 Jun, 2017

1 commit

  • There is need to instruct the HW offloaded path to push certain matched
    packets to cpu/kernel for further analysis. So this patch introduces a
    new TRAP control action to TC.

    For kernel datapath, this action does not make much sense. So with the
    same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
    dropped in the datapath (and virtually ejected to an upper level, which
    does not exist in case of kernel).

    Signed-off-by: Jiri Pirko
    Reviewed-by: Yotam Gigi
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Jiri Pirko
     

18 May, 2017

2 commits

  • Currently, the filter chains are direcly put into the private structures
    of qdiscs. In order to be able to have multiple chains per qdisc and to
    allow filter chains sharing among qdiscs, there is a need for common
    object that would hold the chains. This introduces such object and calls
    it "tcf_block".

    Helpers to get and put the blocks are provided to be called from
    individual qdisc code. Also, the original filter_list pointers are left
    in qdisc privs to allow the entry into tcf_block processing without any
    added overhead of possible multiple pointer dereference on fast path.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Move tc_classify function to cls_api.c where it belongs, rename it to
    fit the namespace.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

14 Apr, 2017

1 commit


13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

11 Feb, 2017

1 commit


06 Dec, 2016

1 commit

  • 1) Old code was hard to maintain, due to complex lock chains.
    (We probably will be able to remove some kfree_rcu() in callers)

    2) Using a single timer to update all estimators does not scale.

    3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
    is not supposed to work well)

    In this rewrite :

    - I removed the RB tree that had to be scanned in
    gen_estimator_active(). qdisc dumps should be much faster.

    - Each estimator has its own timer.

    - Estimations are maintained in net_rate_estimator structure,
    instead of dirtying the qdisc. Minor, but part of the simplification.

    - Reading the estimator uses RCU and a seqcount to provide proper
    support for 32bit kernels.

    - We reduce memory need when estimators are not used, since
    we store a pointer, instead of the bytes/packets counters.

    - xt_rateest_mt() no longer has to grab a spinlock.
    (In the future, xt_rateest_tg() could be switched to per cpu counters)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet