02 Apr, 2020

1 commit

  • Although we intentionally use an ordered workqueue for all tc
    filter works, the ordering is not guaranteed by RCU work,
    given that tcf_queue_work() is esstenially a call_rcu().

    This problem is demostrated by Thomas:

    CPU 0:
    tcf_queue_work()
    tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);

    -> Migration to CPU 1

    CPU 1:
    tcf_queue_work(&p->rwork, tcindex_destroy_work);

    so the 2nd work could be queued before the 1st one, which leads
    to a free-after-free.

    Enforcing this order in RCU work is hard as it requires to change
    RCU code too. Fortunately we can workaround this problem in tcindex
    filter by taking a temporary refcnt, we only refcnt it right before
    we begin to destroy it. This simplifies the code a lot as a full
    refcnt requires much more changes in tcindex_set_parms().

    Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
    Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()")
    Cc: Thomas Gleixner
    Cc: Paul E. McKenney
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Reviewed-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Cong Wang
     

31 Mar, 2020

5 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next:

    1) Add support to specify a stateful expression in set definitions,
    this allows users to specify e.g. counters per set elements.

    2) Flowtable software counter support.

    3) Flowtable hardware offload counter support, from wenxu.

    3) Parallelize flowtable hardware offload requests, from Paul Blakey.
    This includes a patch to add one work entry per offload command.

    4) Several patches to rework nf_queue refcount handling, from Florian
    Westphal.

    4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
    information is missing and set up indirect flow block as TC_SETUP_FT,
    patch from wenxu.

    5) Stricter netlink attribute sanity check on filters, from Romain Bellan
    and Florent Fourcot.

    5) Annotations to make sparse happy, from Jules Irenge.

    6) Improve icmp errors in debugging information, from Haishuang Yan.

    7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.

    8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add support for TPROXY via a new bpf helper, bpf_sk_assign().

    This helper requires the BPF program to discover the socket via a call
    to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
    helper takes its own reference to the socket in addition to any existing
    reference that may or may not currently be obtained for the duration of
    BPF processing. For the destination socket to receive the traffic, the
    traffic must be routed towards that socket via local route. The
    simplest example route is below, but in practice you may want to route
    traffic more narrowly (eg by CIDR):

    $ ip route add local default dev lo

    This patch avoids trying to introduce an extra bit into the skb->sk, as
    that would require more invasive changes to all code interacting with
    the socket to ensure that the bit is handled correctly, such as all
    error-handling cases along the path from the helper in BPF through to
    the orphan path in the input. Instead, we opt to use the destructor
    variable to switch on the prefetch of the socket.

    Signed-off-by: Joe Stringer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz

    Joe Stringer
     
  • It may be up to the driver (in case ANY HW stats is passed) to select
    which type of HW stats he is going to use. Add an infrastructure to
    expose this information to user.

    $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
    $ tc -s filter show dev enp3s0np1 ingress
    filter protocol ip pref 1 flower chain 0
    filter protocol ip pref 1 flower chain 0 handle 0x1
    eth_type ipv4
    dst_ip 192.168.1.1
    in_hw in_hw_count 2
    action order 1: gact action drop
    random type none pass val 0
    index 1 ref 1 bind 1 installed 10 sec used 10 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    used_hw_stats immediate <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Introduce a helper to pass value and selector to. The helper packs them
    into struct and puts them into netlink message.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

28 Mar, 2020

1 commit

  • The indirect block setup should use TC_SETUP_FT as the type instead of
    TC_SETUP_BLOCK. Adjust existing users of the indirect flow block
    infrastructure.

    Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support")
    Signed-off-by: wenxu
    Signed-off-by: Pablo Neira Ayuso

    wenxu
     

27 Mar, 2020

5 commits


26 Mar, 2020

2 commits

  • Overlapping header include additions in macsec.c

    A bug fix in 'net' overlapping with the removal of 'version'
    string in ena_netdev.c

    Overlapping test additions in selftests Makefile

    Overlapping PCI ID table adjustments in iwlwifi driver.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
    pkt->skb->tc_redirected = 1;
    ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
    pkt->skb->tc_from_ingress = 1;
    ^~

    To avoid a direct dependency with tc actions from netfilter, wrap the
    redirect bits around CONFIG_NET_REDIRECT and move helpers to
    include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
    only existing client of these bits in the tree.

    This patch adds skb_set_redirected() that sets on the redirected bit
    on the skbuff, it specifies if the packet was redirect from ingress
    and resets the timestamp (timestamp reset was originally missing in the
    netfilter bugfix).

    Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
    Reported-by: noreply@ellerman.id.au
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

25 Mar, 2020

1 commit

  • Currently the software CBS does not consider the packet sending time
    when depleting the credits. It caused the throughput to be
    Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where
    Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope
    is expected. In order to fix the issue above, this patch takes the time
    when the packet sending completes into account by moving the anchor time
    variable "last" ahead to the send completion time upon transmission and
    adding wait when the next dequeue request comes before the send
    completion time of the previous packet.

    changelog:
    V2->V3:
    - remove unnecessary whitespace cleanup
    - add the checks if port_rate is 0 before division

    V1->V2:
    - combine variable "send_completed" into "last"
    - add the comment for estimate of the packet sending

    Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
    Signed-off-by: Zh-yuan Ye
    Reviewed-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Zh-yuan Ye
     

24 Mar, 2020

1 commit

  • Commit 53eca1f3479f ("net: rename flow_action_hw_stats_types* ->
    flow_action_hw_stats*") renamed just the flow action types and
    helpers. For consistency rename variables, enums, struct members
    and UAPI too (note that this UAPI was not in any official release,
    yet).

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

20 Mar, 2020

1 commit

  • The skbedit action "priority" is used for adjusting SKB priority. Allow
    drivers to offload the action by introducing two new skbedit getters and a
    new flow action, and initializing appropriately in tc_setup_flow_action().

    Signed-off-by: Petr Machata
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     

19 Mar, 2020

2 commits

  • In the commit referenced below, hw_stats_type of an entry is set for every
    entry that corresponds to a pedit action. However, the assignment is only
    done after the entry pointer is bumped, and therefore could overwrite
    memory outside of the entries array.

    The reason for this positioning may have been that the current entry's
    hw_stats_type is already set above, before the action-type dispatch.
    However, if there are no more actions, the assignment is wrong. And if
    there are, the next round of the for_each_action loop will make the
    assignment before the action-type dispatch anyway.

    Therefore fix this issue by simply reordering the two lines.

    Fixes: 74522e7baae2 ("net: sched: set the hw_stats_type in pedit loop")
    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Currently, on replace, the previous action instance params
    is swapped with a newly allocated params. The old params is
    only freed (via kfree_rcu), without releasing the allocated
    ct zone template related to it.

    Call tcf_ct_params_free (via call_rcu) for the old params,
    so it will release it.

    Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
    Signed-off-by: Paul Blakey
    Signed-off-by: David S. Miller

    Paul Blakey
     

18 Mar, 2020

4 commits

  • Add a new attribute to control the fq qdisc hrtimer slack.

    Default is set to 10 usec.

    When/if packets are throttled, fq set up an hrtimer that can
    lead to one interrupt per packet in the throttled queue.

    By using a timer slack, we allow better use of timer interrupts,
    by giving them a chance to call multiple timer callbacks
    at each hardware interrupt.

    Also, giving a slack allows FQ to dequeue batches of packets
    instead of a single one, thus increasing xmit_more efficiency.

    This has no negative effect on the rate a TCP flow can sustain,
    since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)

    v2: added strict netlink checking (as feedback from Jakub Kicinski)

    Tested:
    1000 concurrent flows all using paced packets.
    1,000,000 packets sent per second.

    Before the patch :

    $ vmstat 2 10
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 0 60726784 23628 3485992 0 0 138 1 977 535 0 12 87 0 0
    0 0 0 60714700 23628 3485628 0 0 0 0 1568827 26462 0 22 78 0 0
    1 0 0 60716012 23628 3485656 0 0 0 0 1570034 26216 0 22 78 0 0
    0 0 0 60722420 23628 3485492 0 0 0 0 1567230 26424 0 22 78 0 0
    0 0 0 60727484 23628 3485556 0 0 0 0 1568220 26200 0 22 78 0 0
    2 0 0 60718900 23628 3485380 0 0 0 40 1564721 26630 0 22 78 0 0
    2 0 0 60718096 23628 3485332 0 0 0 0 1562593 26432 0 22 78 0 0
    0 0 0 60719608 23628 3485064 0 0 0 0 1563806 26238 0 22 78 0 0
    1 0 0 60722876 23628 3485236 0 0 0 130 1565874 26566 0 22 78 0 0
    1 0 0 60722752 23628 3484908 0 0 0 0 1567646 26247 0 22 78 0 0

    After the patch, slack of 10 usec, we can see a reduction of interrupts
    per second, and a small decrease of reported cpu usage.

    $ vmstat 2 10
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    1 0 0 60722564 23628 3484728 0 0 133 1 696 545 0 13 87 0 0
    1 0 0 60722568 23628 3484824 0 0 0 0 977278 25469 0 20 80 0 0
    0 0 0 60716396 23628 3484764 0 0 0 0 979997 25326 0 20 80 0 0
    0 0 0 60713844 23628 3484960 0 0 0 0 981394 25249 0 20 80 0 0
    2 0 0 60720468 23628 3484916 0 0 0 0 982860 25062 0 20 80 0 0
    1 0 0 60721236 23628 3484856 0 0 0 0 982867 25100 0 20 80 0 0
    1 0 0 60722400 23628 3484456 0 0 0 8 982698 25303 0 20 80 0 0
    0 0 0 60715396 23628 3484428 0 0 0 0 981777 25176 0 20 80 0 0
    0 0 0 60716520 23628 3486544 0 0 0 36 978965 27857 0 21 79 0 0
    0 0 0 60719592 23628 3486516 0 0 0 22 977318 25106 0 20 80 0 0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • qdisc_watchdog_schedule_range_ns() can use the newly added slack
    and avoid rearming the hrtimer a bit earlier than the current
    value. This patch has no effect if delta_ns parameter
    is zero.

    Note that this means the max slack is potentially doubled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Some packet schedulers might want to add a slack
    when programming hrtimers. This can reduce number
    of interrupts and increase batch sizes and thus
    give good xmit_more savings.

    This commit adds qdisc_watchdog_schedule_range_ns()
    helper, with an extra delta_ns parameter.

    Legacy qdisc_watchdog_schedule_n() becomes an inline
    passing a zero slack.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • flow_action_hw_stats_types_check() helper takes one of the
    FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
    the arguments to the opening bracket of the helper there
    is no way to call this helper and stay under 80 characters.

    Remove the "types" part from the new flow_action helpers
    and enum values.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

16 Mar, 2020

2 commits

  • For a single pedit action, multiple offload entries may be used. Set the
    hw_stats_type to all of them.

    Fixes: 44f865801741 ("sched: act: allow user to specify type of HW stats for a filter")
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • route4_change() allocates a new filter and copies values from
    the old one. After the new filter is inserted into the hash
    table, the old filter should be removed and freed, as the final
    step of the update.

    However, the current code mistakenly removes the new one. This
    looks apparently wrong to me, and it causes double "free" and
    use-after-free too, as reported by syzbot.

    Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com
    Fixes: 1109c00547fc ("net: sched: RCU cls_route")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

15 Mar, 2020

4 commits

  • When the RED Qdisc is currently configured to enable ECN, the RED algorithm
    is used to decide whether a certain SKB should be marked. If that SKB is
    not ECN-capable, it is early-dropped.

    It is also possible to keep all traffic in the queue, and just mark the
    ECN-capable subset of it, as appropriate under the RED algorithm. Some
    switches support this mode, and some installations make use of it.

    To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
    configured with this flag, non-ECT traffic is enqueued instead of being
    early-dropped.

    Signed-off-by: Petr Machata
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Petr Machata
     
  • The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
    of global RED flags. These are passed in tc_red_qopt.flags. However none of
    these qdiscs validate the flag field, and just copy it over wholesale to
    internal structures, and later dump it back. (An exception is GRED, which
    does validate for VQs -- however not for the main setup.)

    A broken userspace can therefore configure a qdisc with arbitrary
    unsupported flags, and later expect to see the flags on qdisc dump. The
    current ABI therefore allows storage of several bits of custom data to
    qdisc instances of the types mentioned above. How many bits, depends on
    which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
    flags ECN and HARDDROP, and the rest is not interpreted.

    If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
    and at the same time it needs to retain the possibility to store 6 bits of
    uninterpreted data. Likewise RED, which adds a new flag later in this
    patchset.

    To that end, this patch adds a new function, red_get_flags(), to split the
    passed flags of RED-like qdiscs to flags and user bits, and
    red_validate_flags() to validate the resulting configuration. It further
    adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.

    Signed-off-by: Petr Machata
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Petr Machata
     
  • In commit 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex")
    I moved cp->hash calculation before the first
    tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched.
    This difference could lead to another out of bound access.

    cp->alloc_hash should always be the size allocated, we should
    update it after this tcindex_alloc_perfect_hash().

    Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com
    Fixes: 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • syzbot reported a use-after-free in tcindex_dump(). This is due to
    the lack of RTNL in the deferred rcu work. We queue this work with
    RTNL in tcindex_change(), later, tcindex_dump() is called:

    fh = tp->ops->get(tp, t->tcm_handle);
    ...
    err = tp->ops->change(..., &fh, ...);
    tfilter_notify(..., fh, ...);

    but there is nothing to serialize the pending
    tcindex_partial_destroy_work() with tcindex_dump().

    Fix this by simply holding RTNL in tcindex_partial_destroy_work(),
    so that it won't be called until RTNL is released after
    tc_new_tfilter() is completed.

    Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com
    Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()")
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

13 Mar, 2020

8 commits

  • Minor overlapping changes, nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pass the zone's flow table instance on the flow action to the drivers.
    Thus, allowing drivers to register FT add/del/stats callbacks.

    Finally, enable hardware offload on the flow table instance.

    Signed-off-by: Paul Blakey
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Paul Blakey
     
  • If driver deleted an FT entry, a FT failed to offload, or registered to the
    flow table after flows were already added, we still get packets in
    software.

    For those packets, while restoring the ct state from the flow table
    entry, refresh it's hardware offload.

    Signed-off-by: Paul Blakey
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Paul Blakey
     
  • Provide an API to restore the ct state pointer.

    This may be used by drivers to restore the ct state if they
    miss in tc chain after they already did the hardware connection
    tracking action (ct_metadata action).

    For example, consider the following rule on chain 0 that is in_hw,
    however chain 1 is not_in_hw:

    $ tc filter add dev ... chain 0 ... \
    flower ... action ct pipe action goto chain 1

    Packets of a flow offloaded (via nf flow table offload) by the driver
    hit this rule in hardware, will be marked with the ct metadata action
    (mark, label, zone) that does the equivalent of the software ct action,
    and when the packet jumps to hardware chain 1, there would be a miss.

    CT was already processed in hardware. Therefore, the driver's miss
    handling should restore the ct state on the skb, using the provided API,
    and continue the packet processing in chain 1.

    Signed-off-by: Paul Blakey
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Paul Blakey
     
  • NF flow table API associate 5-tuple rule with an action list by calling
    the flow table type action() CB to fill the rule's actions.

    In action CB of act_ct, populate the ct offload entry actions with a new
    ct_metadata action. Initialize the ct_metadata with the ct mark, label and
    zone information. If ct nat was performed, then also append the relevant
    packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).

    Drivers that offload the ft entries may match on the 5-tuple and perform
    the action list.

    Signed-off-by: Paul Blakey
    Reviewed-by: Jiri Pirko
    Reviewed-by: Edward Cree
    Signed-off-by: David S. Miller

    Paul Blakey
     
  • David S. Miller
     
  • There was a bug that was causing packets to be sent to the driver
    without first calling dequeue() on the "child" qdisc. And the KASAN
    report below shows that sending a packet without calling dequeue()
    leads to bad results.

    The problem is that when checking the last qdisc "child" we do not set
    the returned skb to NULL, which can cause it to be sent to the driver,
    and so after the skb is sent, it may be freed, and in some situations a
    reference to it may still be in the child qdisc, because it was never
    dequeued.

    The crash log looks like this:

    [ 19.937538] ==================================================================
    [ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
    [ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
    [ 19.939612]
    [ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
    [ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
    [ 19.941523] Call Trace:
    [ 19.941774]
    [ 19.941985] dump_stack+0x97/0xe0
    [ 19.942323] print_address_description.constprop.0+0x3b/0x60
    [ 19.942884] ? taprio_dequeue_soft+0x620/0x780
    [ 19.943325] ? taprio_dequeue_soft+0x620/0x780
    [ 19.943767] __kasan_report.cold+0x1a/0x32
    [ 19.944173] ? taprio_dequeue_soft+0x620/0x780
    [ 19.944612] kasan_report+0xe/0x20
    [ 19.944954] taprio_dequeue_soft+0x620/0x780
    [ 19.945380] __qdisc_run+0x164/0x18d0
    [ 19.945749] net_tx_action+0x2c4/0x730
    [ 19.946124] __do_softirq+0x268/0x7bc
    [ 19.946491] irq_exit+0x17d/0x1b0
    [ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
    [ 19.947280] apic_timer_interrupt+0xf/0x20
    [ 19.947687]
    [ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
    [ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
    [ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    [ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
    [ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
    [ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
    [ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
    [ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
    [ 19.954408] ? default_idle_call+0x2e/0x70
    [ 19.954816] ? default_idle+0x1f/0x2d0
    [ 19.955192] default_idle_call+0x5e/0x70
    [ 19.955584] do_idle+0x3d4/0x500
    [ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
    [ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
    [ 19.956829] ? trace_hardirqs_on+0x30/0x160
    [ 19.957242] cpu_startup_entry+0x19/0x20
    [ 19.957633] start_secondary+0x2a6/0x380
    [ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
    [ 19.958486] secondary_startup_64+0xa4/0xb0
    [ 19.958921]
    [ 19.959078] Allocated by task 33:
    [ 19.959412] save_stack+0x1b/0x80
    [ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
    [ 19.960222] kmem_cache_alloc+0xe4/0x230
    [ 19.960617] __alloc_skb+0x91/0x510
    [ 19.960967] ndisc_alloc_skb+0x133/0x330
    [ 19.961358] ndisc_send_ns+0x134/0x810
    [ 19.961735] addrconf_dad_work+0xad5/0xf80
    [ 19.962144] process_one_work+0x78e/0x13a0
    [ 19.962551] worker_thread+0x8f/0xfa0
    [ 19.962919] kthread+0x2ba/0x3b0
    [ 19.963242] ret_from_fork+0x3a/0x50
    [ 19.963596]
    [ 19.963753] Freed by task 33:
    [ 19.964055] save_stack+0x1b/0x80
    [ 19.964386] __kasan_slab_free+0x12f/0x180
    [ 19.964830] kmem_cache_free+0x80/0x290
    [ 19.965231] ip6_mc_input+0x38a/0x4d0
    [ 19.965617] ipv6_rcv+0x1a4/0x1d0
    [ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
    [ 19.966437] netif_receive_skb+0x8c/0x3c0
    [ 19.966846] br_handle_frame_finish+0x779/0x1310
    [ 19.967302] br_handle_frame+0x42a/0x830
    [ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
    [ 19.968167] __netif_receive_skb_one_core+0x96/0x180
    [ 19.968658] process_backlog+0x198/0x650
    [ 19.969047] net_rx_action+0x2fa/0xaa0
    [ 19.969420] __do_softirq+0x268/0x7bc
    [ 19.969785]
    [ 19.969940] The buggy address belongs to the object at ffff888112862840
    [ 19.969940] which belongs to the cache skbuff_head_cache of size 224
    [ 19.971202] The buggy address is located 140 bytes inside of
    [ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
    [ 19.972344] The buggy address belongs to the page:
    [ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
    [ 19.973930] flags: 0x8000000000010200(slab|head)
    [ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
    [ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
    [ 19.975915] page dumped because: kasan: bad access detected
    [ 19.976461] page_owner tracks the page as allocated
    [ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
    [ 19.978332] prep_new_page+0x24b/0x330
    [ 19.978707] get_page_from_freelist+0x2057/0x2c90
    [ 19.979170] __alloc_pages_nodemask+0x218/0x590
    [ 19.979619] new_slab+0x9d/0x300
    [ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
    [ 19.980421] __slab_alloc.constprop.0+0x30/0x60
    [ 19.980870] kmem_cache_alloc+0x201/0x230
    [ 19.981269] __alloc_skb+0x91/0x510
    [ 19.981620] alloc_skb_with_frags+0x78/0x4a0
    [ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
    [ 19.982476] unix_stream_sendmsg+0x399/0x7f0
    [ 19.982904] sock_sendmsg+0xe2/0x110
    [ 19.983262] ____sys_sendmsg+0x4de/0x6d0
    [ 19.983660] ___sys_sendmsg+0xe4/0x160
    [ 19.984032] __sys_sendmsg+0xab/0x130
    [ 19.984396] do_syscall_64+0xe7/0xae0
    [ 19.984761] page last free stack trace:
    [ 19.985142] __free_pages_ok+0x432/0xbc0
    [ 19.985533] qlist_free_all+0x56/0xc0
    [ 19.985907] quarantine_reduce+0x149/0x170
    [ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
    [ 19.986791] kmem_cache_alloc+0xe4/0x230
    [ 19.987182] prepare_creds+0x24/0x440
    [ 19.987548] do_faccessat+0x80/0x590
    [ 19.987906] do_syscall_64+0xe7/0xae0
    [ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 19.988775]
    [ 19.988930] Memory state around the buggy address:
    [ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    [ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 19.991529] ^
    [ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
    [ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Reported-by: Michael Schmidt
    Signed-off-by: Vinicius Costa Gomes
    Acked-by: Andre Guedes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce
    from net-next.

    Brown bag time.

    Michal noticed that this change doesn't work at all when
    netif_set_real_num_tx_queues() gets called prior to an initial
    dev_activate(), as for instance igb does.

    Doing so dies with:

    [ 40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
    [ 40.586922] #PF: supervisor read access in kernel mode
    [ 40.592668] #PF: error_code(0x0000) - not-present page
    [ 40.598405] PGD 0 P4D 0
    [ 40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
    [ 40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G E 5.6.0-rc3-ethnl.50-default #1
    [ 40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
    [ 40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
    [ 40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
    [ 40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
    [ 40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
    [ 40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
    [ 40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
    [ 40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
    [ 40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
    [ 40.699750] FS: 00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
    [ 40.708782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
    [ 40.723156] Call Trace:
    [ 40.725888] dev_qdisc_set_real_num_tx_queues+0x61/0x90
    [ 40.731725] netif_set_real_num_tx_queues+0x94/0x1d0
    [ 40.737286] __igb_open+0x19a/0x5d0 [igb]
    [ 40.741767] __dev_open+0xbb/0x150
    [ 40.745567] __dev_change_flags+0x157/0x1a0
    [ 40.750240] dev_change_flags+0x23/0x60

    [...]

    Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible")
    Reported-by: Michal Kubecek
    CC: Michal Kubecek
    CC: Eric Dumazet
    CC: Jamal Hadi Salim
    CC: Cong Wang
    CC: Jiri Pirko
    Signed-off-by: Julian Wiedmann
    Signed-off-by: David S. Miller

    Julian Wiedmann
     

12 Mar, 2020

1 commit

  • In their .attach callback, mq[prio] only add the qdiscs of the currently
    active TX queues to the device's qdisc hash list.
    If a user later increases the number of active TX queues, their qdiscs
    are not visible via eg. 'tc qdisc show'.

    Add a hook to netif_set_real_num_tx_queues() that walks all active
    TX queues and adds those which are missing to the hash list.

    CC: Eric Dumazet
    CC: Jamal Hadi Salim
    CC: Cong Wang
    CC: Jiri Pirko
    Signed-off-by: Julian Wiedmann
    Signed-off-by: David S. Miller

    Julian Wiedmann
     

10 Mar, 2020

1 commit

  • Commit 105e808c1da2 ("pie: remove pie_vars->accu_prob_overflows")
    changes the scale of probability values in PIE from (2^64 - 1) to
    (2^56 - 1). This affects the precision of tc_pie_xstats->prob in
    user space.

    This patch ensures user space is unaffected.

    Suggested-by: Eric Dumazet
    Signed-off-by: Leslie Monis
    Signed-off-by: David S. Miller

    Leslie Monis
     

09 Mar, 2020

1 commit

  • Convert zones_lock spinlock to zones_mutex mutex,
    and struct (tcf_ct_flow_table)->ref to a refcount,
    so that control path can use regular GFP_KERNEL allocations
    from standard process context. This is more robust
    in case of memory pressure.

    The refcount is needed because tcf_ct_flow_table_put() can
    be called from RCU callback, thus in BH context.

    The issue was spotted by syzbot, as rhashtable_init()
    was called with a spinlock held, which is bad since GFP_KERNEL
    allocations can sleep.

    Note to developers : Please make sure your patches are tested
    with CONFIG_DEBUG_ATOMIC_SLEEP=y

    BUG: sleeping function called from invalid context at mm/slab.h:565
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 9582, name: syz-executor610
    2 locks held by syz-executor610/9582:
    #0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
    #0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x3f9/0xad0 net/core/rtnetlink.c:5437
    #1: ffffffff8a3961b8 (zones_lock){+...}, at: spin_lock_bh include/linux/spinlock.h:343 [inline]
    #1: ffffffff8a3961b8 (zones_lock){+...}, at: tcf_ct_flow_table_get+0xa3/0x1700 net/sched/act_ct.c:67
    Preemption disabled at:
    [] 0x0
    CPU: 0 PID: 9582 Comm: syz-executor610 Not tainted 5.6.0-rc3-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x188/0x20d lib/dump_stack.c:118
    ___might_sleep.cold+0x1f4/0x23d kernel/sched/core.c:6798
    slab_pre_alloc_hook mm/slab.h:565 [inline]
    slab_alloc_node mm/slab.c:3227 [inline]
    kmem_cache_alloc_node_trace+0x272/0x790 mm/slab.c:3593
    __do_kmalloc_node mm/slab.c:3615 [inline]
    __kmalloc_node+0x38/0x60 mm/slab.c:3623
    kmalloc_node include/linux/slab.h:578 [inline]
    kvmalloc_node+0x61/0xf0 mm/util.c:574
    kvmalloc include/linux/mm.h:645 [inline]
    kvzalloc include/linux/mm.h:653 [inline]
    bucket_table_alloc+0x8b/0x480 lib/rhashtable.c:175
    rhashtable_init+0x3d2/0x750 lib/rhashtable.c:1054
    nf_flow_table_init+0x16d/0x310 net/netfilter/nf_flow_table_core.c:498
    tcf_ct_flow_table_get+0xe33/0x1700 net/sched/act_ct.c:82
    tcf_ct_init+0xba4/0x18a6 net/sched/act_ct.c:1050
    tcf_action_init_1+0x697/0xa20 net/sched/act_api.c:945
    tcf_action_init+0x1e9/0x2f0 net/sched/act_api.c:1001
    tcf_action_add+0xdb/0x370 net/sched/act_api.c:1411
    tc_ctl_action+0x366/0x456 net/sched/act_api.c:1466
    rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5440
    netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2478
    netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
    netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
    netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xcf/0x120 net/socket.c:672
    ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2343
    ___sys_sendmsg+0x100/0x170 net/socket.c:2397
    __sys_sendmsg+0xec/0x1b0 net/socket.c:2430
    do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4403d9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd719af218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004403d9
    RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000005 R09: 00000000004002c8
    R10: 0000000000000008 R11: 00000000000

    Fixes: c34b961a2492 ("net/sched: act_ct: Create nf flow table per zone")
    Signed-off-by: Eric Dumazet
    Cc: Paul Blakey
    Cc: Jiri Pirko
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet