Eric Lee / smarc-fsl-linux-kernel

04 Feb, 2017

1 commit

948e137ad net: fix harmonize_features() vs NETIF_F_HIGHDMA ... Browse Code »

[ Upstream commit 7be2c82cfd5d28d7adb66821a992604eb6dd112e ]

Ashizuka reported a highmem oddity and sent a patch for freescale
fec driver.

But the problem root cause is that core networking stack
must ensure no skb with highmem fragment is ever sent through
a device that does not assert NETIF_F_HIGHDMA in its features.

We need to call illegal_highdma() from harmonize_features()
regardless of CSUM checks.

Fixes: ec5f06156423 ("net: Kill link between CSUM and SG features.")
Signed-off-by: Eric Dumazet
Cc: Pravin Shelar
Reported-by: "Ashizuka, Yuusuke"
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-04 16:47:10 +0800

15 Jan, 2017

2 commits

934ca017c gro: use min_t() in skb_gro_reset_offset() ... Browse Code »

[ Upstream commit 7cfd5fd5a9813f1430290d20c0fead9b4582a307 ]

On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.

Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom")
Reported-by: kernel test robot
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-01-15 20:42:55 +0800
ec0fdcb88 gro: Enter slow-path if there is no tailroom ... Browse Code »

[ Upstream commit 1272ce87fa017ca4cf32920764d879656b7a005a ]

The GRO path has a fast-path where we avoid calling pskb_may_pull
and pskb_expand by directly accessing frag0. However, this should
only be done if we have enough tailroom in the skb as otherwise
we'll have to expand it later anyway.

This patch adds the check by capping frag0_len with the skb tailroom.

Fixes: cb18978cbf45 ("gro: Open-code final pskb_may_pull")
Reported-by: Slava Shwartsman
Signed-off-by: Herbert Xu
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Herbert Xu
2017-01-15 20:42:55 +0800

13 Nov, 2016

1 commit

4e3264d21 bpf: Fix bpf_redirect to an ipip/ip6tnl dev ... Browse Code »

If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.

The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().

If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev). The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out. To fix this,
____dev_forward_skb() is added.

Joint work with Daniel Borkmann.

Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-11-13 12:38:07 +0800

01 Nov, 2016

1 commit

4f2e4ad56 net: mangle zero checksum in skb_checksum_help() ... Browse Code »

Sending zero checksum is ok for TCP, but not for UDP.

UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.

Simply replace such checksum by 0xffff, regardless of transport.

This error was caught on SIT tunnels, but seems generic.

Signed-off-by: Eric Dumazet
Cc: Maciej Żenczykowski
Cc: Willem de Bruijn
Acked-by: Maciej Żenczykowski
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-01 03:29:11 +0800

30 Oct, 2016

2 commits

2a26d99b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"Lots of fixes, mostly drivers as is usually the case.

1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
Khoroshilov.

2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
Pedersen.

3) Don't put aead_req crypto struct on the stack in mac80211, from
Ard Biesheuvel.

4) Several uninitialized variable warning fixes from Arnd Bergmann.

5) Fix memory leak in cxgb4, from Colin Ian King.

6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

7) Several VRF semantic fixes from David Ahern.

8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

11) Fix stale link state during failover in NCSCI driver, from Gavin
Shan.

12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

13) Propvide proper handle when emitting notifications of filter
deletes, from Jamal Hadi Salim.

14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
Leitner.

19) Revert a netns locking change that causes regressions, from Paul
Moore.

20) Add recursion limit to GRO handling, from Sabrina Dubroca.

21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

22) Avoid accessing stale vxlan/geneve socket in data path, from
Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
geneve: avoid using stale geneve socket.
vxlan: avoid using stale vxlan socket.
qede: Fix out-of-bound fastpath memory access
net: phy: dp83848: add dp83822 PHY support
enic: fix rq disable
tipc: fix broadcast link synchronization problem
ibmvnic: Fix missing brackets in init_sub_crq_irqs
ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
net/mlx4_en: Save slave ethtool stats command
net/mlx4_en: Fix potential deadlock in port statistics flow
net/mlx4: Fix firmware command timeout during interrupt test
net/mlx4_core: Do not access comm channel if it has not yet been initialized
net/mlx4_en: Fix panic during reboot
net/mlx4_en: Process all completions in RX rings after port goes up
net/mlx4_en: Resolve dividing by zero in 32-bit system
net/mlx4_core: Change the default value of enable_qos
net/mlx4_core: Avoid setting ports to auto when only one port type is supported
net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
...

Linus Torvalds
2016-10-30 11:33:20 +0800
104ba78c9 packet: on direct_xmit, limit tso and csum to supported devices ... Browse Code »

When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

ethtool -K eth0 tso off tx on
psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

ethtool -K eth0 tx off
psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

v2:
- add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Willem de Bruijn
2016-10-30 03:02:15 +0800

21 Oct, 2016

1 commit

fcd91dd44 net: add recursion limit to GRO ... Browse Code »

Currently, GRO can do unlimited recursion through the gro_receive
handlers. This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem. Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.

This patch adds a recursion counter to the GRO layer to prevent stack
overflow. When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally. This recursion
counter is put in the GRO CB, but could be turned into a percpu counter
if we run out of space in the CB.

Thanks to Vladimír Beneš for the initial bug report.

Fixes: CVE-2016-7039
Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.")
Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan")
Signed-off-by: Sabrina Dubroca
Reviewed-by: Jiri Benc
Acked-by: Hannes Frederic Sowa
Acked-by: Tom Herbert
Signed-off-by: David S. Miller

Sabrina Dubroca
2016-10-21 02:32:22 +0800

19 Oct, 2016

1 commit

e4961b076 net: core: Correctly iterate over lower adjacency list ... Browse Code »

Tamir reported the following trace when processing ARP requests received
via a vlan device on top of a VLAN-aware bridge:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[...]
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.8.0-rc7 #1
Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
task: ffff88017edfea40 task.stack: ffff88017ee10000
RIP: 0010:[] [] netdev_all_lower_get_next_rcu+0x33/0x60
[...]
Call Trace:

[] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum]
[] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum]
[] notifier_call_chain+0x4a/0x70
[] atomic_notifier_call_chain+0x1a/0x20
[] call_netevent_notifiers+0x1b/0x20
[] neigh_update+0x306/0x740
[] neigh_event_ns+0x4e/0xb0
[] arp_process+0x66f/0x700
[] ? common_interrupt+0x8c/0x8c
[] arp_rcv+0x139/0x1d0
[] ? vlan_do_receive+0xda/0x320
[] __netif_receive_skb_core+0x524/0xab0
[] ? dev_queue_xmit+0x10/0x20
[] ? br_forward_finish+0x3d/0xc0 [bridge]
[] ? br_handle_vlan+0xf6/0x1b0 [bridge]
[] __netif_receive_skb+0x18/0x60
[] netif_receive_skb_internal+0x40/0xb0
[] netif_receive_skb+0x1c/0x70
[] br_pass_frame_up+0xc6/0x160 [bridge]
[] ? deliver_clone+0x37/0x50 [bridge]
[] ? br_flood+0xcc/0x160 [bridge]
[] br_handle_frame_finish+0x224/0x4f0 [bridge]
[] br_handle_frame+0x174/0x300 [bridge]
[] __netif_receive_skb_core+0x329/0xab0
[] ? find_next_bit+0x15/0x20
[] ? cpumask_next_and+0x32/0x50
[] ? load_balance+0x178/0x9b0
[] __netif_receive_skb+0x18/0x60
[] netif_receive_skb_internal+0x40/0xb0
[] netif_receive_skb+0x1c/0x70
[] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum]
[] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core]
[] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci]
[] tasklet_action+0xf6/0x110
[] __do_softirq+0xf6/0x280
[] irq_exit+0xdf/0xf0
[] do_IRQ+0x54/0xd0
[] common_interrupt+0x8c/0x8c

The problem is that netdev_all_lower_get_next_rcu() never advances the
iterator, thereby causing the loop over the lower adjacency list to run
forever.

Fix this by advancing the iterator and avoid the infinite loop.

Fixes: 7ce856aaaf13 ("mlxsw: spectrum: Add couple of lower device helper functions")
Signed-off-by: Ido Schimmel
Reported-by: Tamir Winetroub
Reviewed-by: Jiri Pirko
Acked-by: David Ahern
Signed-off-by: David S. Miller

Ido Schimmel
2016-10-19 22:38:08 +0800

16 Oct, 2016

1 commit

9ffc66941 Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin

Linus Torvalds
2016-10-16 01:03:15 +0800

11 Oct, 2016

1 commit

0766f788e latent_entropy: Mark functions with __latent_entropy ... Browse Code »

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook

Emese Revfy
2016-10-11 05:51:45 +0800

04 Oct, 2016

1 commit

93409033a net: Add netdev all_adj_list refcnt propagation to fix panic ... Browse Code »

This is a respin of a patch to fix a relatively easily reproducible kernel
panic related to the all_adj_list handling for netdevs in recent kernels.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

/---eth0.100-eth0
mac0-testbr-
\---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

Matthias Schiffer reported a similar crash in batman-adv:

https://github.com/freifunk-gluon/gluon/issues/680
https://www.open-mesh.org/issues/247

which this patch also seems to resolve.

Signed-off-by: Andrew Collins
Signed-off-by: David S. Miller

Andrew Collins
2016-10-04 14:05:31 +0800

26 Sep, 2016

1 commit

f20fbc071 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c

Resolve two conflicts before pull request for David's net-next tree:

1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027ad0
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.

Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2016-09-26 05:34:19 +0800

25 Sep, 2016

1 commit

2c1e2703f netfilter: call nf_hook_ingress with rcu_read_lock ... Browse Code »

This commit ensures that the rcu read-side lock is held while the
ingress hook is called. This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole
Signed-off-by: Pablo Neira Ayuso

Aaron Conole
2016-09-25 03:25:49 +0800

13 Sep, 2016

1 commit

b20b378d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller

David S. Miller
2016-09-13 06:52:44 +0800

11 Sep, 2016

1 commit

181402a5c net: use IS_ENABLED() instead of checking for built-in or module ... Browse Code »

The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas
Signed-off-by: David S. Miller

Javier Martinez Canillas
2016-09-11 12:19:10 +0800

05 Sep, 2016

1 commit

24b27fc4c bonding: Fix bonding crash ... Browse Code »

Following few steps will crash kernel -

(a) Create bonding master
> modprobe bonding miimon=50
(b) Create macvlan bridge on eth2
> ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
type macvlan
(c) Now try adding eth2 into the bond
> echo +eth2 > /sys/class/net/bond0/bonding/slaves

Bonding does lots of things before checking if the device enslaved is
busy or not.

In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.

This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.

Signed-off-by: Mahesh Bandewar
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Mahesh Bandewar
2016-09-05 02:41:12 +0800

31 Aug, 2016

1 commit

41852497a net: batch calls to flush_all_backlogs() ... Browse Code »

After commit 145dd5f9c88f ("net: flush the softnet backlog in process
context"), we can easily batch calls to flush_all_backlogs() for all
devices processed in rollback_registered_many()

Tested:

Before patch, on an idle host.

modprobe dummy numdummies=10000
perf stat -e context-switches -a rmmod dummy

Performance counter stats for 'system wide':

1,211,798 context-switches

1.302137465 seconds time elapsed

After patch:

perf stat -e context-switches -a rmmod dummy

Performance counter stats for 'system wide':

225,523 context-switches

0.721623566 seconds time elapsed

Signed-off-by: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-31 13:17:20 +0800

27 Aug, 2016

2 commits

6bc506b4f bridge: switchdev: Add forward mark support for stacked devices ... Browse Code »

switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.

It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.

This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.

The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.

Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).

Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.

Signed-off-by: Ido Schimmel
Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2016-08-27 04:13:36 +0800
145dd5f9c net: flush the softnet backlog in process context ... Browse Code »

Currently in process_backlog(), the process_queue dequeuing is
performed with local IRQ disabled, to protect against
flush_backlog(), which runs in hard IRQ context.

This patch moves the flush operation to a work queue and runs the
callback with bottom half disabled to protect the process_queue
against dequeuing.
Since process_queue is now always manipulated in bottom half context,
the irq disable/enable pair around the dequeue operation are removed.

To keep the flush time as low as possible, the flush
works are scheduled on all online cpu simultaneously, using the
high priority work-queue and statically allocated, per cpu,
work structs.

Overall this change increases the time required to destroy a device
to improve slightly the packets reinjection performances.

Acked-by: Hannes Frederic Sowa
Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2016-08-27 02:51:07 +0800

18 Aug, 2016

1 commit

60747ef4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Minor overlapping changes for both merge conflicts.

Resolution work done by Stephen Rothwell was used
as a reference.

Signed-off-by: David S. Miller

David S. Miller
2016-08-18 13:17:32 +0800

14 Aug, 2016

1 commit

952fcfd08 net: remove type_check from dev_get_nest_level() ... Browse Code »

The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:

eth0
Signed-off-by: David S. Miller

Sabrina Dubroca
2016-08-14 06:15:54 +0800

11 Aug, 2016

1 commit

59cc1f61f net: sched: convert qdisc linked list to hashtable ... Browse Code »

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang
Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2016-08-11 08:19:02 +0800

29 Jul, 2016

1 commit

554828ee0 Merge branch 'salted-string-hash' ... Browse Code »

This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.

That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.

It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.

Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.

* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash

Linus Torvalds
2016-07-29 03:26:31 +0800

20 Jul, 2016

1 commit

a7862b458 net: add ndo to setup/query xdp prog in adapter rx ... Browse Code »

Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco
Signed-off-by: David S. Miller

Brenden Blanco
2016-07-20 12:46:31 +0800

10 Jul, 2016

1 commit

1db19db7f net: tracepoint napi:napi_poll add work and budget ... Browse Code »

An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.

Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2016-07-10 06:05:02 +0800

06 Jul, 2016

1 commit

18bfb924f net: introduce default neigh_construct/destroy ndo calls for L2 upper devices ... Browse Code »

L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.

Signed-off-by: Jiri Pirko
Reviewed-by: Ido Schimmel
Signed-off-by: David S. Miller

Jiri Pirko
2016-07-06 00:06:28 +0800

05 Jul, 2016

1 commit

7ce856aaa mlxsw: spectrum: Add couple of lower device helper functions ... Browse Code »

Add functions that iterate over lower devices and find port device.
As a dependency add netdev_for_each_all_lower_dev and
netdev_for_each_all_lower_dev_rcu macro with
netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers.

Also, add functions to return mlxsw struct according to lower device
found and mlxsw_port struct with a reference to lower device.

Signed-off-by: Jiri Pirko
Reviewed-by: Ido Schimmel
Signed-off-by: David S. Miller

Jiri Pirko
2016-07-05 09:25:15 +0800

26 Jun, 2016

1 commit

520ac30f4 net_sched: drop packets after root qdisc lock is released ... Browse Code »

Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet
Acked-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-26 00:19:35 +0800

17 Jun, 2016

2 commits

be4da0e34 net: the space is required after ',' ... Browse Code »

The space is missing after ',', and this will introduce much more
noise when checking patch around.

Signed-off-by: Wei Tang
Signed-off-by: David S. Miller

Wei Tang
2016-06-17 08:41:23 +0800
84d15ae57 net: do not initialise statics to 0 ... Browse Code »

This patch fixes the checkpatch.pl error to dev.c:

ERROR: do not initialise statics to 0

Signed-off-by: Wei Tang
Signed-off-by: David S. Miller

Wei Tang
2016-06-17 08:41:22 +0800

11 Jun, 2016

2 commits

8387ff257 vfs: make the string hashes salt the hash ... Browse Code »

We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum
Cc: George Spelvin
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-06-11 11:21:46 +0800
a70b506ef bpf: enforce recursion limit on redirects ... Browse Code »

Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
Currently, they are not handeled by the limiter when attached to clsact's
egress parent, for example, and a buggy program redirecting it to the
same device again could run into stack overflow eventually. It would be
good if we could notify an admin to give him a chance to react. We reuse
xmit_recursion instead of having one private to eBPF, so that the stack's
current recursion depth will be taken into account as well. Follow-up to
commit 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") and
27b29f63058d ("bpf: add bpf_redirect() helper").

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-06-11 09:00:57 +0800

09 Jun, 2016

1 commit

40e4e713e net: Reduce queue allocation to one in kdump kernel ... Browse Code »

When in kdump kernel, reduce memory usage by only using a single Queue
Set for multiqueue devices. So make netif_get_num_default_rss_queues()
return one, when in kdump kernel.

Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller

Hariprasad Shenai
2016-06-09 02:13:58 +0800

08 Jun, 2016

2 commits

f9eb8aea2 net_sched: transform qdisc running bit into a seqcount ... Browse Code »

Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.

Signed-off-by: Eric Dumazet
Cc: Cong Wang
Cc: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-08 07:37:13 +0800
3bcb846ca net: get rid of spin_trylock() in net_tx_action() ... Browse Code »

Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.

The reasons we want to get rid of this spin_trylock() are :

1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.

2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.

Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.

3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.

4) RT kernels can not use the spin_trylock() here.

With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.

Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)

("mpstat -I SCPU 1" is much happier now)

Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Acked-by: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-08 06:32:03 +0800

17 May, 2016

1 commit

7e2c3aea4 net: also make sch_handle_egress() drop monitor ready ... Browse Code »

Follow-up for 8a3a4c6e7b34 ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.

Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-05-17 02:02:44 +0800

12 May, 2016

1 commit

74b20582a net: l3mdev: Add hook in ip and ipv6 ... Browse Code »

Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-05-12 07:31:40 +0800

09 May, 2016

1 commit

8a3a4c6e7 net: make sch_handle_ingress() drop monitor ready ... Browse Code »

TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
to say ifb.

Packet is not dropped, but consumed.

Only TC_ACT_SHOT is a clear indication something went wrong.

Signed-off-by: Eric Dumazet
Cc: Jamal Hadi Salim
Acked-by: Alexei Starovoitov
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-09 11:53:22 +0800

05 May, 2016

1 commit

b1dc497b2 net: Fix netdev_fix_features so that TSO_MANGLEID is only available with TSO ... Browse Code »

This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present. This way we will also handle ECN correctly of TSO is not
present.

Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2016-05-05 01:32:27 +0800