04 Feb, 2017

1 commit

  • [ Upstream commit 7be2c82cfd5d28d7adb66821a992604eb6dd112e ]

    Ashizuka reported a highmem oddity and sent a patch for freescale
    fec driver.

    But the problem root cause is that core networking stack
    must ensure no skb with highmem fragment is ever sent through
    a device that does not assert NETIF_F_HIGHDMA in its features.

    We need to call illegal_highdma() from harmonize_features()
    regardless of CSUM checks.

    Fixes: ec5f06156423 ("net: Kill link between CSUM and SG features.")
    Signed-off-by: Eric Dumazet
    Cc: Pravin Shelar
    Reported-by: "Ashizuka, Yuusuke"
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

15 Jan, 2017

2 commits

  • [ Upstream commit 7cfd5fd5a9813f1430290d20c0fead9b4582a307 ]

    On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
    so we shall use min_t() instead of min() to avoid a compiler error.

    Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom")
    Reported-by: kernel test robot
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1272ce87fa017ca4cf32920764d879656b7a005a ]

    The GRO path has a fast-path where we avoid calling pskb_may_pull
    and pskb_expand by directly accessing frag0. However, this should
    only be done if we have enough tailroom in the skb as otherwise
    we'll have to expand it later anyway.

    This patch adds the check by capping frag0_len with the skb tailroom.

    Fixes: cb18978cbf45 ("gro: Open-code final pskb_may_pull")
    Reported-by: Slava Shwartsman
    Signed-off-by: Herbert Xu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     

13 Nov, 2016

1 commit

  • If the bpf program calls bpf_redirect(dev, 0) and dev is
    an ipip/ip6tnl, it currently includes the mac header.
    e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
    of IP-IP.

    The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
    is not needed because the ethhdr should have been pulled once already
    and then got pushed back just before calling the bpf_prog.
    At egress, this patch calls skb_postpull_rcsum().

    If bpf_redirect(dev, BPF_F_INGRESS) is called,
    it also fails now because it calls dev_forward_skb() which
    eventually calls eth_type_trans(skb, dev). The eth_type_trans()
    will set skb->type = PACKET_OTHERHOST because the mac address
    does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
    will eventually cause the ip_rcv() errors out. To fix this,
    ____dev_forward_skb() is added.

    Joint work with Daniel Borkmann.

    Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
    Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

01 Nov, 2016

1 commit

  • Sending zero checksum is ok for TCP, but not for UDP.

    UDPv6 receiver should by default drop a frame with a 0 checksum,
    and UDPv4 would not verify the checksum and might accept a corrupted
    packet.

    Simply replace such checksum by 0xffff, regardless of transport.

    This error was caught on SIT tunnels, but seems generic.

    Signed-off-by: Eric Dumazet
    Cc: Maciej Żenczykowski
    Cc: Willem de Bruijn
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2016

2 commits

  • Pull networking fixes from David Miller:
    "Lots of fixes, mostly drivers as is usually the case.

    1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
    Khoroshilov.

    2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
    Pedersen.

    3) Don't put aead_req crypto struct on the stack in mac80211, from
    Ard Biesheuvel.

    4) Several uninitialized variable warning fixes from Arnd Bergmann.

    5) Fix memory leak in cxgb4, from Colin Ian King.

    6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

    7) Several VRF semantic fixes from David Ahern.

    8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

    9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

    10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

    11) Fix stale link state during failover in NCSCI driver, from Gavin
    Shan.

    12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

    13) Propvide proper handle when emitting notifications of filter
    deletes, from Jamal Hadi Salim.

    14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

    15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

    16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

    17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

    18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
    Leitner.

    19) Revert a netns locking change that causes regressions, from Paul
    Moore.

    20) Add recursion limit to GRO handling, from Sabrina Dubroca.

    21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

    22) Avoid accessing stale vxlan/geneve socket in data path, from
    Pravin Shelar"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
    geneve: avoid using stale geneve socket.
    vxlan: avoid using stale vxlan socket.
    qede: Fix out-of-bound fastpath memory access
    net: phy: dp83848: add dp83822 PHY support
    enic: fix rq disable
    tipc: fix broadcast link synchronization problem
    ibmvnic: Fix missing brackets in init_sub_crq_irqs
    ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
    Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
    arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
    net/mlx4_en: Save slave ethtool stats command
    net/mlx4_en: Fix potential deadlock in port statistics flow
    net/mlx4: Fix firmware command timeout during interrupt test
    net/mlx4_core: Do not access comm channel if it has not yet been initialized
    net/mlx4_en: Fix panic during reboot
    net/mlx4_en: Process all completions in RX rings after port goes up
    net/mlx4_en: Resolve dividing by zero in 32-bit system
    net/mlx4_core: Change the default value of enable_qos
    net/mlx4_core: Avoid setting ports to auto when only one port type is supported
    net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
    ...

    Linus Torvalds
     
  • When transmitting on a packet socket with PACKET_VNET_HDR and
    PACKET_QDISC_BYPASS, validate device support for features requested
    in vnet_hdr.

    Drop TSO packets sent to devices that do not support TSO or have the
    feature disabled. Note that the latter currently do process those
    packets correctly, regardless of not advertising the feature.

    Because of SKB_GSO_DODGY, it is not sufficient to test device features
    with netif_needs_gso. Full validate_xmit_skb is needed.

    Switch to software checksum for non-TSO packets that request checksum
    offload if that device feature is unsupported or disabled. Note that
    similar to the TSO case, device drivers may perform checksum offload
    correctly even when not advertising it.

    When switching to software checksum, packets hit skb_checksum_help,
    which has two BUG_ON checksum not in linear segment. Packet sockets
    always allocate at least up to csum_start + csum_off + 2 as linear.

    Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

    ethtool -K eth0 tso off tx on
    psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
    psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

    ethtool -K eth0 tx off
    psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
    psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

    v2:
    - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

    Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option")
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

21 Oct, 2016

1 commit

  • Currently, GRO can do unlimited recursion through the gro_receive
    handlers. This was fixed for tunneling protocols by limiting tunnel GRO
    to one level with encap_mark, but both VLAN and TEB still have this
    problem. Thus, the kernel is vulnerable to a stack overflow, if we
    receive a packet composed entirely of VLAN headers.

    This patch adds a recursion counter to the GRO layer to prevent stack
    overflow. When a gro_receive function hits the recursion limit, GRO is
    aborted for this skb and it is processed normally. This recursion
    counter is put in the GRO CB, but could be turned into a percpu counter
    if we run out of space in the CB.

    Thanks to Vladimír Beneš for the initial bug report.

    Fixes: CVE-2016-7039
    Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.")
    Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan")
    Signed-off-by: Sabrina Dubroca
    Reviewed-by: Jiri Benc
    Acked-by: Hannes Frederic Sowa
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

19 Oct, 2016

1 commit

  • Tamir reported the following trace when processing ARP requests received
    via a vlan device on top of a VLAN-aware bridge:

    NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
    [...]
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.8.0-rc7 #1
    Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
    task: ffff88017edfea40 task.stack: ffff88017ee10000
    RIP: 0010:[] [] netdev_all_lower_get_next_rcu+0x33/0x60
    [...]
    Call Trace:

    [] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum]
    [] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum]
    [] notifier_call_chain+0x4a/0x70
    [] atomic_notifier_call_chain+0x1a/0x20
    [] call_netevent_notifiers+0x1b/0x20
    [] neigh_update+0x306/0x740
    [] neigh_event_ns+0x4e/0xb0
    [] arp_process+0x66f/0x700
    [] ? common_interrupt+0x8c/0x8c
    [] arp_rcv+0x139/0x1d0
    [] ? vlan_do_receive+0xda/0x320
    [] __netif_receive_skb_core+0x524/0xab0
    [] ? dev_queue_xmit+0x10/0x20
    [] ? br_forward_finish+0x3d/0xc0 [bridge]
    [] ? br_handle_vlan+0xf6/0x1b0 [bridge]
    [] __netif_receive_skb+0x18/0x60
    [] netif_receive_skb_internal+0x40/0xb0
    [] netif_receive_skb+0x1c/0x70
    [] br_pass_frame_up+0xc6/0x160 [bridge]
    [] ? deliver_clone+0x37/0x50 [bridge]
    [] ? br_flood+0xcc/0x160 [bridge]
    [] br_handle_frame_finish+0x224/0x4f0 [bridge]
    [] br_handle_frame+0x174/0x300 [bridge]
    [] __netif_receive_skb_core+0x329/0xab0
    [] ? find_next_bit+0x15/0x20
    [] ? cpumask_next_and+0x32/0x50
    [] ? load_balance+0x178/0x9b0
    [] __netif_receive_skb+0x18/0x60
    [] netif_receive_skb_internal+0x40/0xb0
    [] netif_receive_skb+0x1c/0x70
    [] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum]
    [] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core]
    [] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci]
    [] tasklet_action+0xf6/0x110
    [] __do_softirq+0xf6/0x280
    [] irq_exit+0xdf/0xf0
    [] do_IRQ+0x54/0xd0
    [] common_interrupt+0x8c/0x8c

    The problem is that netdev_all_lower_get_next_rcu() never advances the
    iterator, thereby causing the loop over the lower adjacency list to run
    forever.

    Fix this by advancing the iterator and avoid the infinite loop.

    Fixes: 7ce856aaaf13 ("mlxsw: spectrum: Add couple of lower device helper functions")
    Signed-off-by: Ido Schimmel
    Reported-by: Tamir Winetroub
    Reviewed-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

04 Oct, 2016

1 commit

  • This is a respin of a patch to fix a relatively easily reproducible kernel
    panic related to the all_adj_list handling for netdevs in recent kernels.

    The following sequence of commands will reproduce the issue:

    ip link add link eth0 name eth0.100 type vlan id 100
    ip link add link eth0 name eth0.200 type vlan id 200
    ip link add name testbr type bridge
    ip link set eth0.100 master testbr
    ip link set eth0.200 master testbr
    ip link add link testbr mac0 type macvlan
    ip link delete dev testbr

    This creates an upper/lower tree of (excuse the poor ASCII art):

    /---eth0.100-eth0
    mac0-testbr-
    \---eth0.200-eth0

    When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
    the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
    reference to eth0 is added, so this results in a panic.

    This change adds reference count propagation so things are handled properly.

    Matthias Schiffer reported a similar crash in batman-adv:

    https://github.com/freifunk-gluon/gluon/issues/680
    https://www.open-mesh.org/issues/247

    which this patch also seems to resolve.

    Signed-off-by: Andrew Collins
    Signed-off-by: David S. Miller

    Andrew Collins
     

26 Sep, 2016

1 commit

  • Conflicts:
    net/netfilter/core.c
    net/netfilter/nf_tables_netdev.c

    Resolve two conflicts before pull request for David's net-next tree:

    1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant
    ip_hdr assignment") from the net tree and commit ddc8b6027ad0
    ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

    2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and
    Aaron Conole's patches to replace list_head with single linked list.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

25 Sep, 2016

1 commit


13 Sep, 2016

1 commit


11 Sep, 2016

1 commit


05 Sep, 2016

1 commit

  • Following few steps will crash kernel -

    (a) Create bonding master
    > modprobe bonding miimon=50
    (b) Create macvlan bridge on eth2
    > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
    type macvlan
    (c) Now try adding eth2 into the bond
    > echo +eth2 > /sys/class/net/bond0/bonding/slaves

    Bonding does lots of things before checking if the device enslaved is
    busy or not.

    In this case when the notifier call-chain sends notifications, the
    bond_netdev_event() assumes that the rx_handler /rx_handler_data is
    registered while the bond_enslave() hasn't progressed far enough to
    register rx_handler for the new slave.

    This patch adds a rx_handler check that can be performed right at the
    beginning of the enslave code to avoid getting into this situation.

    Signed-off-by: Mahesh Bandewar
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

31 Aug, 2016

1 commit

  • After commit 145dd5f9c88f ("net: flush the softnet backlog in process
    context"), we can easily batch calls to flush_all_backlogs() for all
    devices processed in rollback_registered_many()

    Tested:

    Before patch, on an idle host.

    modprobe dummy numdummies=10000
    perf stat -e context-switches -a rmmod dummy

    Performance counter stats for 'system wide':

    1,211,798 context-switches

    1.302137465 seconds time elapsed

    After patch:

    perf stat -e context-switches -a rmmod dummy

    Performance counter stats for 'system wide':

    225,523 context-switches

    0.721623566 seconds time elapsed

    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Aug, 2016

2 commits

  • switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
    port netdevs so that packets being flooded by the device won't be
    flooded twice.

    It works by assigning a unique identifier (the ifindex of the first
    bridge port) to bridge ports sharing the same parent ID. This prevents
    packets from being flooded twice by the same switch, but will flood
    packets through bridge ports belonging to a different switch.

    This method is problematic when stacked devices are taken into account,
    such as VLANs. In such cases, a physical port netdev can have upper
    devices being members in two different bridges, thus requiring two
    different 'offload_fwd_mark's to be configured on the port netdev, which
    is impossible.

    The main problem is that packet and netdev marking is performed at the
    physical netdev level, whereas flooding occurs between bridge ports,
    which are not necessarily port netdevs.

    Instead, packet and netdev marking should really be done in the bridge
    driver with the switch driver only telling it which packets it already
    forwarded. The bridge driver will mark such packets using the mark
    assigned to the ingress bridge port and will prevent the packet from
    being forwarded through any bridge port sharing the same mark (i.e.
    having the same parent ID).

    Remove the current switchdev 'offload_fwd_mark' implementation and
    instead implement the proposed method. In addition, make rocker - the
    sole user of the mark - use the proposed method.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Currently in process_backlog(), the process_queue dequeuing is
    performed with local IRQ disabled, to protect against
    flush_backlog(), which runs in hard IRQ context.

    This patch moves the flush operation to a work queue and runs the
    callback with bottom half disabled to protect the process_queue
    against dequeuing.
    Since process_queue is now always manipulated in bottom half context,
    the irq disable/enable pair around the dequeue operation are removed.

    To keep the flush time as low as possible, the flush
    works are scheduled on all online cpu simultaneously, using the
    high priority work-queue and statically allocated, per cpu,
    work structs.

    Overall this change increases the time required to destroy a device
    to improve slightly the packets reinjection performances.

    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

18 Aug, 2016

1 commit


14 Aug, 2016

1 commit


11 Aug, 2016

1 commit

  • Convert the per-device linked list into a hashtable. The primary
    motivation for this change is that currently, we're not tracking all the
    qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
    performed over the linked list by qdisc_match_from_root() is rather
    expensive.

    The ultimate goal is to get rid of hidden qdiscs completely, which will
    bring much more determinism in user experience.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

29 Jul, 2016

1 commit

  • This changes the vfs dentry hashing to mix in the parent pointer at the
    _beginning_ of the hash, rather than at the end.

    That actually improves both the hash and the code generation, because we
    can move more of the computation to the "static" part of the dcache
    setup, and do less at lookup runtime.

    It turns out that a lot of other hash users also really wanted to mix in
    a base pointer as a 'salt' for the hash, and so the slightly extended
    interface ends up working well for other cases too.

    Users that want a string hash that is purely about the string pass in a
    'salt' pointer of NULL.

    * merge branch 'salted-string-hash':
    fs/dcache.c: Save one 32-bit multiply in dcache lookup
    vfs: make the string hashes salt the hash

    Linus Torvalds
     

20 Jul, 2016

1 commit


10 Jul, 2016

1 commit

  • An important information for the napi_poll tracepoint is knowing
    the work done (packets processed) by the napi_poll() call. Add
    both the work done and budget, as they are related.

    Handle trace_napi_poll() param change in dropwatch/drop_monitor
    and in python perf script netdev-times.py in backward compat way,
    as python fortunately supports optional parameter handling.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

06 Jul, 2016

1 commit


05 Jul, 2016

1 commit

  • Add functions that iterate over lower devices and find port device.
    As a dependency add netdev_for_each_all_lower_dev and
    netdev_for_each_all_lower_dev_rcu macro with
    netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers.

    Also, add functions to return mlxsw struct according to lower device
    found and mlxsw_port struct with a reference to lower device.

    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko
     

26 Jun, 2016

1 commit

  • Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2016

2 commits


11 Jun, 2016

2 commits

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
    Currently, they are not handeled by the limiter when attached to clsact's
    egress parent, for example, and a buggy program redirecting it to the
    same device again could run into stack overflow eventually. It would be
    good if we could notify an admin to give him a chance to react. We reuse
    xmit_recursion instead of having one private to eBPF, so that the stack's
    current recursion depth will be taken into account as well. Follow-up to
    commit 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") and
    27b29f63058d ("bpf: add bpf_redirect() helper").

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Jun, 2016

1 commit


08 Jun, 2016

2 commits

  • Instead of using a single bit (__QDISC___STATE_RUNNING)
    in sch->__state, use a seqcount.

    This adds lockdep support, but more importantly it will allow us
    to sample qdisc/class statistics without having to grab qdisc root lock.

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Note: Tom Herbert posted almost same patch 3 months back, but for
    different reasons.

    The reasons we want to get rid of this spin_trylock() are :

    1) Under high qdisc pressure, the spin_trylock() has almost no
    chance to succeed.

    2) We loop multiple times in softirq handler, eventually reaching
    the max retry count (10), and we schedule ksoftirqd.

    Since we want to adhere more strictly to ksoftirqd being waked up in
    the future (https://lwn.net/Articles/687617/), better avoid spurious
    wakeups.

    3) calls to __netif_reschedule() dirty the cache line containing
    q->next_sched, slowing down the owner of qdisc.

    4) RT kernels can not use the spin_trylock() here.

    With help of busylock, we get the qdisc spinlock fast enough, and
    the trylock trick brings only performance penalty.

    Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
    performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)

    ("mpstat -I SCPU 1" is much happier now)

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2016

1 commit

  • Follow-up for 8a3a4c6e7b34 ("net: make sch_handle_ingress() drop
    monitor ready") to also make the egress side drop monitor ready.

    Also here only TC_ACT_SHOT is a clear indication that something
    went wrong. Hence don't provide false positives to drop monitors
    such as 'perf record -e skb:kfree_skb ...'.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

12 May, 2016

1 commit

  • Currently the VRF driver uses the rx_handler to switch the skb device
    to the VRF device. Switching the dev prior to the ip / ipv6 layer
    means the VRF driver has to duplicate IP/IPv6 processing which adds
    overhead and makes features such as retaining the ingress device index
    more complicated than necessary.

    This patch moves the hook to the L3 layer just after the first NF_HOOK
    for PRE_ROUTING. This location makes exposing the original ingress device
    trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
    in the future.

    dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
    with the switched device through the packet taps to maintain current
    behavior (tcpdump can be used on either the vrf device or the enslaved
    devices).

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

09 May, 2016

1 commit

  • TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
    to say ifb.

    Packet is not dropped, but consumed.

    Only TC_ACT_SHOT is a clear indication something went wrong.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Acked-by: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 May, 2016

1 commit