05 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

01 Jul, 2017

3 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • A set of overlapping changes in macvlan and the rocker
    driver, nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jun, 2017

1 commit

  • When qdisc fail to init, qdisc_create would invoke the destroy callback
    to cleanup. But there is no check if the callback exists really. So it
    would cause the panic if there is no real destroy callback like the qdisc
    codel, fq, and so on.

    Take codel as an example following:
    When a malicious user constructs one invalid netlink msg, it would cause
    codel_init->codel_change->nla_parse_nested failed.
    Then kernel would invoke the destroy callback directly but qdisc codel
    doesn't define one. It causes one panic as a result.

    Now add one the check for destroy to avoid the possible panic.

    Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
    Signed-off-by: Gao Feng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Gao Feng
     

22 Jun, 2017

1 commit

  • In order to be able to retrieve the attached programs from cls_bpf
    and act_bpf, we need to expose the prog ids via netlink so that
    an application can later on get an fd based on the id through the
    BPF_PROG_GET_FD_BY_ID command, and dump related prog info via
    BPF_OBJ_GET_INFO_BY_FD command for bpf(2).

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jun, 2017

2 commits

  • Allow requesting of zero UDP checksum for encapsulated packets. The name and
    meaning of the attribute is "NO_CSUM" in order to have the same meaning of
    the attribute missing and being 0.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • There's currently no way to request (outer) UDP checksum with
    act_tunnel_key. This is problem especially for IPv6. Right now, tunnel_key
    action with IPv6 does not work without going through hassles: both sides
    have to have udp6zerocsumrx configured on the tunnel interface. This is
    obviously not a good solution universally.

    It makes more sense to compute the UDP checksum by default even for IPv4.
    Just set the default to request the checksum when using act_tunnel_key.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

15 Jun, 2017

3 commits

  • The conflicts were two cases of overlapping changes in
    batman-adv and the qed driver.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • I'm reviewing static checker warnings where we do ERR_PTR(0), which is
    the same as NULL. I'm pretty sure we intended to return ERR_PTR(-EINVAL)
    here. Sometimes these bugs lead to a NULL dereference but I don't
    immediately see that problem here.

    Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the conventional network headers")
    Signed-off-by: Dan Carpenter
    Acked-by: Amir Vadai
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • Laura reported a sleep-in-atomic kernel warning inside
    tcf_act_police_init() which calls gen_replace_estimator() with
    spinlock protection.

    It is not necessary in this case, we already have RTNL lock here
    so it is enough to protect concurrent writers. For the reader,
    i.e. tcf_act_police(), it needs to make decision based on this
    rate estimator, in the worst case we drop more/less packets than
    necessary while changing the rate in parallel, it is still acceptable.

    Reported-by: Laura Abbott
    Reported-by: Nick Huber
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

08 Jun, 2017

1 commit

  • We need to push the chain index down to the drivers, so they have the
    information to which chain the rule belongs. For now, no driver supports
    multichain offload, so only chain 0 is supported. This is needed to
    prevent chain squashes during offload for now. Later this will be used
    to implement multichain offload.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

07 Jun, 2017

1 commit

  • There is need to instruct the HW offloaded path to push certain matched
    packets to cpu/kernel for further analysis. So this patch introduces a
    new TRAP control action to TC.

    For kernel datapath, this action does not make much sense. So with the
    same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
    dropped in the datapath (and virtually ejected to an upper level, which
    does not exist in case of kernel).

    Signed-off-by: Jiri Pirko
    Reviewed-by: Yotam Gigi
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Jiri Pirko
     

05 Jun, 2017

2 commits

  • It really makes no sense to have cls_act enabled without cls. In that
    case, the cls_act code is dead. So select it.

    This also fixes an issue recently reported by kbuild robot:
    [linux-next:master 1326/4151] net/sched/act_api.c:37:18: error: implicit declaration of function 'tcf_chain_get'

    Reported-by: kbuild test robot
    Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Benefit from the support of ip header fields dissection and
    allow users to set rules matching on ipv4 tos and ttl or
    ipv6 traffic-class and hoplimit.

    Signed-off-by: Or Gerlitz
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Or Gerlitz
     

26 May, 2017

2 commits


25 May, 2017

1 commit


23 May, 2017

4 commits


20 May, 2017

1 commit

  • skb->csum_not_inet carries the indication on which algorithm is needed to
    compute checksum on skb in the transmit path, when skb->ip_summed is equal
    to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been
    yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise,
    assume Internet Checksum is needed and thus set skb->csum_not_inet to 0.

    Suggested-by: Tom Herbert
    Signed-off-by: Davide Caratti
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Davide Caratti
     

18 May, 2017

11 commits

  • We still need to initialize err to -EINVAL for
    the case where 'opt' is NULL in dsmark_init().

    Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Introduce new type of termination action called "goto_chain". This allows
    user to specify a chain to be processed. This action type is
    then processed as a return value in tcf_classify loop in similar
    way as "reclassify" is, only it does not reset to the first filter
    in chain but rather reset to the first filter of the desired chain.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Tp pointer will be needed by the next patch in order to get the chain.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Instead of having only one filter per block, introduce a list of chains
    for every block. Create chain 0 by default. UAPI is extended so the user
    can specify which chain he wants to change. If the new attribute is not
    specified, chain 0 is used. That allows to maintain backward
    compatibility. If chain does not exist and user wants to manipulate with
    it, new chain is created with specified index. Also, when last filter is
    removed from the chain, the chain is destroyed.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Since there will be multiple chains to dump, push chain dumping code to
    a separate function.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Introduce struct tcf_chain object and set of helpers around it. Wraps up
    insertion, deletion and search in the filter chain.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Call the helper from the function rather than to always adjust the
    return value of the function.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • The use of "nprio" variable in tc_ctl_tfilter is a bit cryptic and makes
    a reader wonder what is going on for a while. So help him to understand
    this priority allocation dance a litte bit better.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Make the name consistent with the rest of the helpers around.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Currently, the filter chains are direcly put into the private structures
    of qdiscs. In order to be able to have multiple chains per qdisc and to
    allow filter chains sharing among qdiscs, there is a need for common
    object that would hold the chains. This introduces such object and calls
    it "tcf_block".

    Helpers to get and put the blocks are provided to be called from
    individual qdisc code. Also, the original filter_list pointers are left
    in qdisc privs to allow the entry into tcf_block processing without any
    added overhead of possible multiple pointer dereference on fast path.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Move tc_classify function to cls_api.c where it belongs, rename it to
    fit the namespace.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

17 May, 2017

1 commit

  • BBR congestion control depends on pacing, and pacing is
    currently handled by sch_fq packet scheduler for performance reasons,
    and also because implemening pacing with FQ was convenient to truly
    avoid bursts.

    However there are many cases where this packet scheduler constraint
    is not practical.
    - Many linux hosts are not focusing on handling thousands of TCP
    flows in the most efficient way.
    - Some routers use fq_codel or other AQM, but still would like
    to use BBR for the few TCP flows they initiate/terminate.

    This patch implements an automatic fallback to internal pacing.

    Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

    If sch_fq happens to be in the egress path, pacing is delegated to
    the qdisc, otherwise pacing is done by TCP itself.

    One advantage of pacing from TCP stack is to get more precise rtt
    estimations, and less work done from TX completion, since TCP Small
    queue limits are not generally hit. Setups with single TX queue but
    many cpus might even benefit from this.

    Note that unlike sch_fq, we do not take into account header sizes.
    Taking care of these headers would add additional complexity for
    no practical differences in behavior.

    Some performance numbers using 800 TCP_STREAM flows rate limited to
    ~48 Mbit per second on 40Gbit NIC.

    If MQ+pfifo_fast is used on the NIC :

    $ sar -n DEV 1 5 | grep eth
    14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
    14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
    14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
    14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
    14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
    Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
    4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
    0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
    1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
    1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

    Now use MQ+FQ :

    lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
    lpaa23:~# tc qdisc replace dev eth0 root mq

    $ sar -n DEV 1 5 | grep eth
    14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
    14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
    14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
    14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
    14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
    Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
    1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
    0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
    4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
    3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

    As expected, number of interrupts per second is very different.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 May, 2017

1 commit

  • In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
    hashtable") we missed the opportunity to considerably speed up
    tc_dump_tclass_root() if a qdisc handle is provided by user.

    Instead of iterating all the qdiscs, use qdisc_match_from_root()
    to directly get the one we look for.

    Signed-off-by: Eric Dumazet
    Cc: Jiri Kosina
    Cc: Jamal Hadi Salim
    Cc: Cong Wang
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 May, 2017

2 commits

  • fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
    vmalloc fallback. Use the kvmalloc variant instead. Keep the
    __GFP_REPEAT flag based on explanation from Eric:

    "At the time, tests on the hardware I had in my labs showed that
    vmalloc() could deliver pages spread all over the memory and that was
    a small penalty (once memory is fragmented enough, not at boot time)"

    The way how the code is constructed means, however, that we prefer to go
    and hit the OOM killer before we fall back to the vmalloc for requests

    Acked-by: Vlastimil Babka
    Cc: Eric Dumazet
    Cc: David Miller
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

1 commit


03 May, 2017

1 commit

  • Jump is now the only one using value action opcode. This is going to
    change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.

    This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
    bit check, not a value check.

    Fixes: e0ee84ded796 ("net sched actions: Complete the JUMPX opcode")
    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko