06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

17 Jul, 2020

2 commits

  • This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

    Signed-off-by: Petr Machata
    Signed-off-by: Jakub Kicinski

    Petr Machata
     
  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jul, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

30 Jun, 2020

1 commit

  • A following patch introduces qevents, points in qdisc algorithm where
    packet can be processed by user-defined filters. Should this processing
    lead to a situation where a new packet is to be enqueued on the same port,
    holding the root lock would lead to deadlocks. To solve the issue, qevent
    handler needs to unlock and relock the root lock when necessary.

    To that end, add the root lock argument to the qdisc op enqueue, and
    propagate throughout.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     

01 Oct, 2019

1 commit

  • syzbot reported a crash in cbq_normalize_quanta() caused
    by an out of range cl->priority.

    iproute2 enforces this check, but malicious users do not.

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN PTI
    Modules linked in:
    CPU: 1 PID: 26447 Comm: syz-executor.1 Not tainted 5.3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:cbq_normalize_quanta.part.0+0x1fd/0x430 net/sched/sch_cbq.c:902
    RSP: 0018:ffff8801a5c333b0 EFLAGS: 00010206
    RAX: 0000000020000003 RBX: 00000000fffffff8 RCX: ffffc9000712f000
    RDX: 00000000000043bf RSI: ffffffff83be8962 RDI: 0000000100000018
    RBP: ffff8801a5c33420 R08: 000000000000003a R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000002ef
    R13: ffff88018da95188 R14: dffffc0000000000 R15: 0000000000000015
    FS: 00007f37d26b1700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004c7cec CR3: 00000001bcd0a006 CR4: 00000000001626f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    [] cbq_normalize_quanta include/net/pkt_sched.h:27 [inline]
    [] cbq_addprio net/sched/sch_cbq.c:1097 [inline]
    [] cbq_set_wrr+0x2d7/0x450 net/sched/sch_cbq.c:1115
    [] cbq_change_class+0x987/0x225b net/sched/sch_cbq.c:1537
    [] tc_ctl_tclass+0x555/0xcd0 net/sched/sch_api.c:2329
    [] rtnetlink_rcv_msg+0x485/0xc10 net/core/rtnetlink.c:5248
    [] netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2510
    [] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5266
    [] netlink_unicast_kernel net/netlink/af_netlink.c:1324 [inline]
    [] netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1350
    [] netlink_sendmsg+0x89a/0xd50 net/netlink/af_netlink.c:1939
    [] sock_sendmsg_nosec net/socket.c:673 [inline]
    [] sock_sendmsg+0x12e/0x170 net/socket.c:684
    [] ___sys_sendmsg+0x81d/0x960 net/socket.c:2359
    [] __sys_sendmsg+0x105/0x1d0 net/socket.c:2397
    [] SYSC_sendmsg net/socket.c:2406 [inline]
    [] SyS_sendmsg+0x29/0x30 net/socket.c:2404
    [] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305
    [] entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

02 Apr, 2019

2 commits

  • The same code to flush qdisc tree and purge the qdisc queue
    is duplicated in many places and in most cases it does not
    respect NOLOCK qdisc: the global backlog len is used and the
    per CPU values are ignored.

    This change addresses the above, factoring-out the relevant
    code and using the helpers introduced by the previous patch
    to fetch the correct backlog len.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Classful qdiscs can't access directly the child qdiscs backlog
    length: if such qdisc is NOLOCK, per CPU values should be
    accounted instead.

    Most qdiscs no not respect the above. As a result, qstats fetching
    for most classful qdisc is currently incorrect: if the child qdisc is
    NOLOCK, it always reports 0 len backlog.

    This change introduces a pair of helpers to safely fetch
    both backlog and qlen and use them in stats class dumping
    functions, fixing the above issue and cleaning a bit the code.

    DRR needs also to access the child qdisc queue length, so it
    needs custom handling.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Sep, 2018

1 commit

  • Current implementation of qdisc_destroy() decrements Qdisc reference
    counter and only actually destroy Qdisc if reference counter value reached
    zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
    describe the way in which this function currently implemented and used.

    Extract code that deallocates Qdisc into new private qdisc_destroy()
    function. It is intended to be shared between regular qdisc_put() and its
    unlocked version that is introduced in next patch in this series.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

22 Dec, 2017

9 commits

  • This patch adds extack support for the cbq qdisc implementation by
    adding NL_SET_ERR_MSG in validation of user input.
    Also it serves to illustrate a use case of how the infrastructure ops
    api changes are to be used by individual qdiscs.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for the function qdisc_create_dflt which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why qdisc_create_dflt failed. The function qdisc_create_dflt
    will also call an init callback which can fail by any per-qdisc specific
    handling.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for the function tcf_block_get which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why tcf_block_get failed.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for the function qdisc_get_rtab which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why qdisc_get_rtab failed.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for graft callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for block callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for class change callback api. This prepares
    to handle extack support inside each specific class implementation.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for init callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch fix checkpatch issues for upcomming patches according to the
    sched api file. It changes mostly how to check on null pointer.

    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     

29 Nov, 2017

1 commit

  • q->link.block is not initialized, that leads to EINVAL when one tries to
    add filter there. So initialize it properly.

    This can be reproduced by:
    $ tc qdisc add dev eth0 root handle 1: cbq avpkt 1000 rate 1000Mbit bandwidth 1000Mbit
    $ tc filter add dev eth0 parent 1: protocol ip prio 100 u32 match ip protocol 0 0x00 flowid 1:1

    Reported-by: Jaroslav Aster
    Reported-by: Ivan Vecera
    Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
    Signed-off-by: Jiri Pirko
    Acked-by: Eelco Chaudron
    Reviewed-by: Ivan Vecera
    Signed-off-by: David S. Miller

    Jiri Pirko
     

22 Oct, 2017

1 commit


17 Oct, 2017

1 commit


02 Sep, 2017

1 commit


31 Aug, 2017

1 commit

  • CBQ can fail on ->init by wrong nl attributes or simply for missing any,
    f.e. if it's set as a default qdisc then TCA_OPTIONS (opt) will be NULL
    when it is activated. The first thing init does is parse opt but it will
    dereference a null pointer if used as a default qdisc, also since init
    failure at default qdisc invokes ->reset() which cancels all timers then
    we'll also dereference two more null pointers (timer->base) as they were
    never initialized.

    To reproduce:
    $ sysctl net.core.default_qdisc=cbq
    $ ip l set ethX up

    Crash log of the first null ptr deref:
    [44727.907454] BUG: unable to handle kernel NULL pointer dereference at (null)
    [44727.907600] IP: cbq_init+0x27/0x205
    [44727.907676] PGD 59ff4067
    [44727.907677] P4D 59ff4067
    [44727.907742] PUD 59c70067
    [44727.907807] PMD 0
    [44727.907873]
    [44727.907982] Oops: 0000 [#1] SMP
    [44727.908054] Modules linked in:
    [44727.908126] CPU: 1 PID: 21312 Comm: ip Not tainted 4.13.0-rc6+ #60
    [44727.908235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [44727.908477] task: ffff88005ad42700 task.stack: ffff880037214000
    [44727.908672] RIP: 0010:cbq_init+0x27/0x205
    [44727.908838] RSP: 0018:ffff8800372175f0 EFLAGS: 00010286
    [44727.909018] RAX: ffffffff816c3852 RBX: ffff880058c53800 RCX: 0000000000000000
    [44727.909222] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8800372175f8
    [44727.909427] RBP: ffff880037217650 R08: ffffffff81b0f380 R09: 0000000000000000
    [44727.909631] R10: ffff880037217660 R11: 0000000000000020 R12: ffffffff822a44c0
    [44727.909835] R13: ffff880058b92000 R14: 00000000ffffffff R15: 0000000000000001
    [44727.910040] FS: 00007ff8bc583740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
    [44727.910339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [44727.910525] CR2: 0000000000000000 CR3: 00000000371e5000 CR4: 00000000000406e0
    [44727.910731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [44727.910936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [44727.911141] Call Trace:
    [44727.911291] ? lockdep_init_map+0xb6/0x1ba
    [44727.911461] ? qdisc_alloc+0x14e/0x187
    [44727.911626] qdisc_create_dflt+0x7a/0x94
    [44727.911794] ? dev_activate+0x129/0x129
    [44727.911959] attach_one_default_qdisc+0x36/0x63
    [44727.912132] netdev_for_each_tx_queue+0x3d/0x48
    [44727.912305] dev_activate+0x4b/0x129
    [44727.912468] __dev_open+0xe7/0x104
    [44727.912631] __dev_change_flags+0xc6/0x15c
    [44727.912799] dev_change_flags+0x25/0x59
    [44727.912966] do_setlink+0x30c/0xb3f
    [44727.913129] ? check_chain_key+0xb0/0xfd
    [44727.913294] ? check_chain_key+0xb0/0xfd
    [44727.913463] rtnl_newlink+0x3a4/0x729
    [44727.913626] ? rtnl_newlink+0x117/0x729
    [44727.913801] ? ns_capable_common+0xd/0xb1
    [44727.913968] ? ns_capable+0x13/0x15
    [44727.914131] rtnetlink_rcv_msg+0x188/0x197
    [44727.914300] ? rcu_read_unlock+0x3e/0x5f
    [44727.914465] ? rtnl_newlink+0x729/0x729
    [44727.914630] netlink_rcv_skb+0x6c/0xce
    [44727.914796] rtnetlink_rcv+0x23/0x2a
    [44727.914956] netlink_unicast+0x103/0x181
    [44727.915122] netlink_sendmsg+0x326/0x337
    [44727.915291] sock_sendmsg_nosec+0x14/0x3f
    [44727.915459] sock_sendmsg+0x29/0x2e
    [44727.915619] ___sys_sendmsg+0x209/0x28b
    [44727.915784] ? do_raw_spin_unlock+0xcd/0xf8
    [44727.915954] ? _raw_spin_unlock+0x27/0x31
    [44727.916121] ? __handle_mm_fault+0x651/0xdb1
    [44727.916290] ? check_chain_key+0xb0/0xfd
    [44727.916461] __sys_sendmsg+0x45/0x63
    [44727.916626] ? __sys_sendmsg+0x45/0x63
    [44727.916792] SyS_sendmsg+0x19/0x1b
    [44727.916950] entry_SYSCALL_64_fastpath+0x23/0xc2
    [44727.917125] RIP: 0033:0x7ff8bbc96690
    [44727.917286] RSP: 002b:00007ffc360991e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    [44727.917579] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007ff8bbc96690
    [44727.917783] RDX: 0000000000000000 RSI: 00007ffc36099230 RDI: 0000000000000003
    [44727.917987] RBP: ffff880037217f98 R08: 0000000000000001 R09: 0000000000000003
    [44727.918190] R10: 00007ffc36098fb0 R11: 0000000000000246 R12: 0000000000000006
    [44727.918393] R13: 000000000066f1a0 R14: 00007ffc360a12e0 R15: 0000000000000000
    [44727.918597] ? trace_hardirqs_off_caller+0xa7/0xcf
    [44727.918774] Code: 41 5f 5d c3 66 66 66 66 90 55 48 8d 56 04 45 31 c9
    49 c7 c0 80 f3 b0 81 48 89 e5 41 55 41 54 53 48 89 fb 48 8d 7d a8 48 83
    ec 48 b7 0e be 07 00 00 00 83 e9 04 e8 e6 f7 d8 ff 85 c0 0f 88 bb
    [44727.919332] RIP: cbq_init+0x27/0x205 RSP: ffff8800372175f0
    [44727.919516] CR2: 0000000000000000

    Fixes: 0fbbeb1ba43b ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

26 Aug, 2017

1 commit

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

17 Aug, 2017

1 commit

  • This callback is used for deactivating class in parent qdisc.
    This is cheaper to test queue length right here.

    Also this allows to catch draining screwed backlog and prevent
    second deactivation of already inactive parent class which will
    crash kernel for sure. Kernel with print warning at destruction
    of child qdisc where no packets but backlog is not zero.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

16 Aug, 2017

1 commit

  • Traffic filters could keep direct pointers to classes in classful qdisc,
    thus qdisc destruction first removes all filters before freeing classes.
    Class destruction methods also tries to free attached filters but now
    this isn't safe because tcf_block_put() unlike to tcf_destroy_chain()
    cannot be called second time.

    This patch set class->block to NULL after first tcf_block_put() and
    turn second call into no-op.

    Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

07 Jun, 2017

1 commit

  • There is need to instruct the HW offloaded path to push certain matched
    packets to cpu/kernel for further analysis. So this patch introduces a
    new TRAP control action to TC.

    For kernel datapath, this action does not make much sense. So with the
    same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
    dropped in the datapath (and virtually ejected to an upper level, which
    does not exist in case of kernel).

    Signed-off-by: Jiri Pirko
    Reviewed-by: Yotam Gigi
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Jiri Pirko
     

18 May, 2017

2 commits

  • Currently, the filter chains are direcly put into the private structures
    of qdiscs. In order to be able to have multiple chains per qdisc and to
    allow filter chains sharing among qdiscs, there is a need for common
    object that would hold the chains. This introduces such object and calls
    it "tcf_block".

    Helpers to get and put the blocks are provided to be called from
    individual qdisc code. Also, the original filter_list pointers are left
    in qdisc privs to allow the entry into tcf_block processing without any
    added overhead of possible multiple pointer dereference on fast path.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Move tc_classify function to cls_api.c where it belongs, rename it to
    fit the namespace.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

14 Apr, 2017

1 commit


13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

11 Feb, 2017

1 commit


26 Dec, 2016

1 commit

  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

06 Dec, 2016

1 commit

  • 1) Old code was hard to maintain, due to complex lock chains.
    (We probably will be able to remove some kfree_rcu() in callers)

    2) Using a single timer to update all estimators does not scale.

    3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
    is not supposed to work well)

    In this rewrite :

    - I removed the RB tree that had to be scanned in
    gen_estimator_active(). qdisc dumps should be much faster.

    - Each estimator has its own timer.

    - Estimations are maintained in net_rate_estimator structure,
    instead of dirtying the qdisc. Minor, but part of the simplification.

    - Reading the estimator uses RCU and a seqcount to provide proper
    support for 32bit kernels.

    - We reduce memory need when estimators are not used, since
    we store a pointer, instead of the bytes/packets counters.

    - xt_rateest_mt() no longer has to grab a spinlock.
    (In the future, xt_rateest_tg() could be switched to per cpu counters)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jun, 2016

1 commit

  • Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2016

2 commits

  • __QDISC_STATE_THROTTLED bit manipulation is rather expensive
    for HTB and few others.

    I already removed it for sch_fq in commit f2600cf02b5b
    ("net: sched: avoid costly atomic operation in fq_dequeue()")
    and so far nobody complained.

    When one ore more packets are stuck in one or more throttled
    HTB class, a htb dequeue() performs two atomic operations
    to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
    lock is held.

    Removing this pair of atomic operations bring me a 8 % performance
    increase on 200 TCP_RR tests, in presence of throttled classes.

    This patch has no side effect, since nothing actually uses
    disc_is_throttled() anymore.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • So far no qdisc ever unset the throttled bit at enqueue() time,
    so CBQ usage of qdisc_is_throttled() was flaky.

    Since __QDISC_STATE_THROTTLED set/unset is way too expensive
    considering that only CBQ was eventually caring for this status,
    it would make sense to implement a Qdisc ops ->is_throttled()
    if we find that this is needed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet