24 Oct, 2019

1 commit

  • UDP IPv6 packets auto flowlabels are using a 32bit secret
    (static u32 hashrnd in net/core/flow_dissector.c) and
    apply jhash() over fields known by the receivers.

    Attackers can easily infer the 32bit secret and use this information
    to identify a device and/or user, since this 32bit secret is only
    set at boot time.

    Really, using jhash() to generate cookies sent on the wire
    is a serious security concern.

    Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
    a dead end. Trying to periodically change the secret (like in sch_sfq.c)
    could change paths taken in the network for long lived flows.

    Let's switch to siphash, as we did in commit df453700e8d8
    ("inet: switch IP ID generator to siphash")

    Using a cryptographically strong pseudo random function will solve this
    privacy issue and more generally remove other weak points in the stack.

    Packet schedulers using skb_get_hash_perturb() benefit from this change.

    Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default")
    Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels")
    Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
    Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit")
    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Berger
    Reported-by: Amit Klein
    Reported-by: Benny Pinkas
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2019

1 commit

  • Recent changes that removed rtnl dependency from rules update path of tc
    also made tcf_block_put() function sleeping. This function is called from
    ops->destroy() of several Qdisc implementations, which in turn is called by
    qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
    which results sleeping-while-atomic BUG.

    Steps to reproduce for sfb:

    tc qdisc add dev ens1f0 handle 1: root sfb
    tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10
    tc qdisc change dev ens1f0 root handle 1: sfb

    Resulting dmesg:

    [ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
    [ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc
    [ 7265.941455] INFO: lockdep is turned off.
    [ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G W 5.3.0-rc8+ #721
    [ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
    [ 7265.945396] Call Trace:
    [ 7265.946709] dump_stack+0x85/0xc0
    [ 7265.947994] ___might_sleep.cold+0xac/0xbc
    [ 7265.949282] __mutex_lock+0x5b/0x960
    [ 7265.950543] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 7265.951803] ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 7265.953022] tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
    [ 7265.954248] tcf_block_put_ext.part.0+0x21/0x50
    [ 7265.955478] tcf_block_put+0x50/0x70
    [ 7265.956694] sfq_destroy+0x15/0x50 [sch_sfq]
    [ 7265.957898] qdisc_destroy+0x5f/0x160
    [ 7265.959099] sfb_change+0x175/0x330 [sch_sfb]
    [ 7265.960304] tc_modify_qdisc+0x324/0x840
    [ 7265.961503] rtnetlink_rcv_msg+0x170/0x4b0
    [ 7265.962692] ? netlink_deliver_tap+0x95/0x400
    [ 7265.963876] ? rtnl_dellink+0x2d0/0x2d0
    [ 7265.965064] netlink_rcv_skb+0x49/0x110
    [ 7265.966251] netlink_unicast+0x171/0x200
    [ 7265.967427] netlink_sendmsg+0x224/0x3f0
    [ 7265.968595] sock_sendmsg+0x5e/0x60
    [ 7265.969753] ___sys_sendmsg+0x2ae/0x330
    [ 7265.970916] ? ___sys_recvmsg+0x159/0x1f0
    [ 7265.972074] ? do_wp_page+0x9c/0x790
    [ 7265.973233] ? __handle_mm_fault+0xcd3/0x19e0
    [ 7265.974407] __sys_sendmsg+0x59/0xa0
    [ 7265.975591] do_syscall_64+0x5c/0xb0
    [ 7265.976753] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 7265.977938] RIP: 0033:0x7f229069f7b8
    [ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
    4
    [ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    [ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8
    [ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003
    [ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0
    [ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
    [ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

    In sfb_change() function use qdisc_purge_queue() instead of
    qdisc_tree_flush_backlog() to properly reset old child Qdisc and save
    pointer to it into local temporary variable. Put reference to Qdisc after
    sch tree lock is released in order not to call potentially sleeping cls API
    in atomic section. This is safe to do because Qdisc has already been reset
    by qdisc_purge_queue() inside sch tree lock critical section.

    Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com
    Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex")
    Suggested-by: Cong Wang
    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

02 Apr, 2019

1 commit

  • The same code to flush qdisc tree and purge the qdisc queue
    is duplicated in many places and in most cases it does not
    respect NOLOCK qdisc: the global backlog len is used and the
    per CPU values are ignored.

    This change addresses the above, factoring-out the relevant
    code and using the helpers introduced by the previous patch
    to fetch the correct backlog len.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Sep, 2018

1 commit

  • Current implementation of qdisc_destroy() decrements Qdisc reference
    counter and only actually destroy Qdisc if reference counter value reached
    zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
    describe the way in which this function currently implemented and used.

    Extract code that deallocates Qdisc into new private qdisc_destroy()
    function. It is intended to be shared between regular qdisc_put() and its
    unlocked version that is introduced in next patch in this series.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

22 Dec, 2017

7 commits


22 Oct, 2017

1 commit


17 Oct, 2017

1 commit


26 Aug, 2017

1 commit

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

07 Jun, 2017

1 commit

  • There is need to instruct the HW offloaded path to push certain matched
    packets to cpu/kernel for further analysis. So this patch introduces a
    new TRAP control action to TC.

    For kernel datapath, this action does not make much sense. So with the
    same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
    dropped in the datapath (and virtually ejected to an upper level, which
    does not exist in case of kernel).

    Signed-off-by: Jiri Pirko
    Reviewed-by: Yotam Gigi
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Jiri Pirko
     

18 May, 2017

2 commits

  • Currently, the filter chains are direcly put into the private structures
    of qdiscs. In order to be able to have multiple chains per qdisc and to
    allow filter chains sharing among qdiscs, there is a need for common
    object that would hold the chains. This introduces such object and calls
    it "tcf_block".

    Helpers to get and put the blocks are provided to be called from
    individual qdisc code. Also, the original filter_list pointers are left
    in qdisc privs to allow the entry into tcf_block processing without any
    added overhead of possible multiple pointer dereference on fast path.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Move tc_classify function to cls_api.c where it belongs, rename it to
    fit the namespace.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

14 Apr, 2017

1 commit


13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

11 Feb, 2017

1 commit


23 Sep, 2016

1 commit


26 Jun, 2016

1 commit

  • Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Mar, 2016

2 commits

  • When the bottom qdisc decides to, for example, drop some packet,
    it calls qdisc_tree_decrease_qlen() to update the queue length
    for all its ancestors, we need to update the backlog too to
    keep the stats on root qdisc accurate.

    Cc: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Remove nearly duplicated code and prepare for the following patch.

    Cc: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

28 Aug, 2015

1 commit

  • For classifiers getting invoked via tc_classify(), we always need an
    extra function call into tc_classify_compat(), as both are being
    exported as symbols and tc_classify() itself doesn't do much except
    handling of reclassifications when tp->classify() returned with
    TC_ACT_RECLASSIFY.

    CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
    all others use tc_classify(). When tc actions are being configured
    out in the kernel, tc_classify() effectively does nothing besides
    delegating.

    We could spare this layer and consolidate both functions. pktgen on
    single CPU constantly pushing skbs directly into the netif_receive_skb()
    path with a dummy classifier on ingress qdisc attached, improves
    slightly from 22.3Mpps to 23.1Mpps.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

19 Aug, 2015

1 commit

  • Those were all workarounds for the formerly double meaning of
    tx_queue_len, which broke scheduling algorithms if untreated.

    Now that all in-tree drivers have been converted away from setting
    tx_queue_len = 0, it should be safe to drop these workarounds for
    categorically broken setups.

    Signed-off-by: Phil Sutter
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Phil Sutter
     

04 May, 2015

1 commit


30 Sep, 2014

1 commit


14 Sep, 2014

1 commit

  • rcu'ify tcf_proto this allows calling tc_classify() without holding
    any locks. Updaters are protected by RTNL.

    This patch prepares the core net_sched infrastracture for running
    the classifier/action chains without holding the qdisc lock however
    it does nothing to ensure cls_xxx and act_xxx types also work without
    locking. Additional patches are required to address the fall out.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     

15 Jan, 2014

1 commit


12 Jul, 2012

1 commit


02 Apr, 2012

1 commit


10 Feb, 2012

1 commit

  • Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
    of other data structures.

    This is intended to be used by IPoIB so that it can remember
    addressing information stored at hard_header_ops->create() time that
    it can fetch when the packet gets to the transmit routine.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Nov, 2011

1 commit

  • Current SFB double hashing is not fulfilling SFB theory, if two flows
    share same rxhash value.

    Using skb_flow_dissect() permits to really have better hash dispersion,
    and get tunnelling support as well.

    Double hashing point was mentioned by Florian Westphal

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Aug, 2011

1 commit


24 Feb, 2011

1 commit

  • This is the Stochastic Fair Blue scheduler, based on work from :

    W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
    Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.

    http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf

    This implementation is based on work done by Juliusz Chroboczek

    General SFB algorithm can be found in figure 14, page 15:

    B[l][n] : L x N array of bins (L levels, N bins per level)
    enqueue()
    Calculate hash function values h{0}, h{1}, .. h{L-1}
    Update bins at each level
    for i = 0 to L - 1
    if (B[i][h{i}].qlen > bin_size)
    B[i][h{i}].p_mark += p_increment;
    else if (B[i][h{i}].qlen == 0)
    B[i][h{i}].p_mark -= p_decrement;
    p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
    if (p_min == 1.0)
    ratelimit();
    else
    mark/drop with probabilty p_min;

    I did the adaptation of Juliusz code to meet current kernel standards,
    and various changes to address previous comments :

    http://thread.gmane.org/gmane.linux.network/90225
    http://thread.gmane.org/gmane.linux.network/90375

    Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
    we can use an external flow classifier if wanted.

    tc qdisc add dev $DEV parent 1:11 handle 11: \
    est 0.5sec 2sec sfb limit 128

    tc filter add dev $DEV protocol ip parent 11: handle 3 \
    flow hash keys dst divisor 1024

    Notes:

    1) SFB default child qdisc is pfifo_fast. It can be changed by another
    qdisc but a child qdisc MUST not drop a packet previously queued. This
    is because SFB needs to handle a dequeued packet in order to maintain
    its virtual queue states. pfifo_head_drop or CHOKe should not be used.

    2) ECN is enabled by default, unlike RED/CHOKe/GRED

    With help from Patrick McHardy & Andi Kleen

    Signed-off-by: Eric Dumazet
    CC: Juliusz Chroboczek
    CC: Stephen Hemminger
    CC: Patrick McHardy
    CC: Andi Kleen
    CC: John W. Linville
    Signed-off-by: David S. Miller

    Eric Dumazet