Eric Lee / smarc-fsl-linux-kernel

15 Jun, 2014

1 commit

46fb51eb9 net: Fix save software checksum complete ... Browse Code »

Geert reported issues regarding checksum complete and UDP.
The logic introduced in commit 7e3cead5172927732f51fde
("net: Save software checksum complete") is not correct.

This patch:
1) Restores code in __skb_checksum_complete_header except for setting
CHECKSUM_UNNECESSARY. This function may be calculating checksum on
something less than skb->len.
2) Adds saving checksum to __skb_checksum_complete. The full packet
checksum 0..skb->len is calculated without adding in pseudo header.
This value is saved in skb->csum and then the pseudo header is added
to that to derive the checksum for validation.
3) In both __skb_checksum_complete_header and __skb_checksum_complete,
set skb->csum_valid to whether checksum of zero was computed. This
allows skb_csum_unnecessary to return true without changing to
CHECKSUM_UNNECESSARY which was done previously.
4) Copy new csum related bits in __copy_skb_header.

Reported-by: Geert Uytterhoeven
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2014-06-15 16:00:49 +0800

13 Jun, 2014

2 commits

f9da455b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.

3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.

4) BPF now has a "random" opcode, from Chema Gonzalez.

5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.

6) Support TCP fastopen over ipv6, from Daniel Lee.

7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

8) Support software TSO in fec driver too, from Nimrod Andy.

9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...

Linus Torvalds
2014-06-13 05:27:40 +0800
e5eca6d41 rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 ... Browse Code »

When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.

The reason is a kerneluserspace API change introduced by commit
88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.

iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").

The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2
(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)

We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.

With this patch "ip link" shows VF information in both old and new
iproute2 versions.

Signed-off-by: Michal Schmidt
Signed-off-by: David S. Miller

Michal Schmidt
2014-06-13 02:07:42 +0800

12 Jun, 2014

5 commits

902455e00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/core/rtnetlink.c
net/core/skbuff.c

Both conflicts were very simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2014-06-12 07:02:55 +0800
c5b461608 net/core: Add VF link state control policy ... Browse Code »

Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state
control to the netlink VF nested structure, but failed to add a proper entry
for the new structure into the VF policy table. Add the missing entry so
the table and the actual data copied into the netlink nested struct are in
sync.

Signed-off-by: Doug Ledford
Signed-off-by: David S. Miller

Doug Ledford
2014-06-12 06:51:37 +0800
7e3cead51 net: Save software checksum complete ... Browse Code »

In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.

Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2014-06-12 06:46:13 +0800
bad93e9d4 net: add __pskb_copy_fclone and pskb_copy_for_clone ... Browse Code »

There are several instances where a pskb_copy or __pskb_copy is
immediately followed by an skb_clone.

Add a couple of new functions to allow the copy skb to be allocated
from the fclone cache and thus speed up subsequent skb_clone calls.

Cc: Alexander Smirnov
Cc: Dmitry Eremin-Solenikov
Cc: Marek Lindner
Cc: Simon Wunderlich
Cc: Antonio Quartulli
Cc: Marcel Holtmann
Cc: Gustavo Padovan
Cc: Johan Hedberg
Cc: Arvid Brodin
Cc: Patrick McHardy
Cc: Pablo Neira Ayuso
Cc: Jozsef Kadlecsik
Cc: Lauro Ramos Venancio
Cc: Aloisio Almeida Jr
Cc: Samuel Ortiz
Cc: Jon Maloy
Cc: Allan Stephens
Cc: Andrew Hendry
Cc: Eric Dumazet
Reviewed-by: Christoph Paasch
Signed-off-by: Octavian Purdila
Signed-off-by: David S. Miller

Octavian Purdila
2014-06-12 06:38:02 +0800
61f83d0d5 net: filter: fix warning on 32-bit arch ... Browse Code »

fix compiler warning on 32-bit architectures:

net/core/filter.c: In function '__sk_run_filter':
net/core/filter.c:540:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:550:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:560:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

Reported-by: Fengguang Wu
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-06-12 06:12:27 +0800

11 Jun, 2014

2 commits

5882a07c7 net: fix UDP tunnel GSO of frag_list GRO packets ... Browse Code »

This patch fixes a kernel BUG_ON in skb_segment. It is hit when
testing two VMs on openvswitch with one VM acting as VXLAN gateway.

During VXLAN packet GSO, skb_segment is called with skb->data
pointing to inner TCP payload. skb_segment calls skb_network_protocol
to retrieve the inner protocol. skb_network_protocol actually expects
skb->data to point to MAC and it calls pskb_may_pull with ETH_HLEN.
This ends up pulling in ETH_HLEN data from header tail. As a result,
pskb_trim logic is skipped and BUG_ON is hit later.

Move skb_push in front of skb_network_protocol so that skb->data
lines up properly.

kernel BUG at net/core/skbuff.c:2999!
Call Trace:
[] tcp_gso_segment+0x122/0x410
[] inet_gso_segment+0x13c/0x390
[] skb_mac_gso_segment+0x9b/0x170
[] skb_udp_tunnel_segment+0xd8/0x390
[] udp4_ufo_fragment+0x120/0x140
[] inet_gso_segment+0x13c/0x390
[] ? default_wake_function+0x12/0x20
[] skb_mac_gso_segment+0x9b/0x170
[] __skb_gso_segment+0x60/0xc0
[] dev_hard_start_xmit+0x183/0x550
[] sch_direct_xmit+0xfe/0x1d0
[] __dev_queue_xmit+0x214/0x4f0
[] dev_queue_xmit+0x10/0x20
[] ip_finish_output+0x66b/0x890
[] ip_output+0x58/0x90
[] ? fib_table_lookup+0x29f/0x350
[] ip_local_out_sk+0x39/0x50
[] iptunnel_xmit+0x10d/0x130
[] vxlan_xmit_skb+0x1d0/0x330 [vxlan]
[] vxlan_tnl_send+0x129/0x1a0 [openvswitch]
[] ovs_vport_send+0x26/0xa0 [openvswitch]
[] do_output+0x2e/0x50 [openvswitch]

Signed-off-by: Wei-Chun Chao
Signed-off-by: David S. Miller

Wei-Chun Chao
2014-06-11 15:48:47 +0800
e430f34ee net: filter: cleanup A/X name usage ... Browse Code »

The macro 'A' used in internal BPF interpreter:
#define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.

This patch is trying to clean up the naming and clarify its usage in the
following way:

- A and X are names of two classic BPF registers

- BPF_REG_A denotes internal BPF register R0 used to map classic register A
in internal BPF programs generated from classic

- BPF_REG_X denotes internal BPF register R7 used to map classic register X
in internal BPF programs generated from classic

- internal BPF instruction format:
struct sock_filter_int {
__u8 code; /* opcode */
__u8 dst_reg:4; /* dest register */
__u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};

- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
BPF_X - means use register X as source operand
BPF_K - means use 32-bit immediate as source operand
In internal:
BPF_X - means use 'src_reg' register as source operand
BPF_K - means use 32-bit immediate as source operand

Suggested-by: Chema Gonzalez
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: Chema Gonzalez
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-06-11 15:13:16 +0800

10 Jun, 2014

1 commit

14208b0ec Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy

Documentation/cgroups/unified-hierarchy.txt

is added.

Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...

Linus Torvalds
2014-06-10 06:03:33 +0800

09 Jun, 2014

2 commits

87757a917 net: force a list_del() in unregister_netdevice_many() ... Browse Code »

unregister_netdevice_many() API is error prone and we had too
many bugs because of dangling LIST_HEAD on stacks.

See commit f87e6f47933e3e ("net: dont leave active on stack LIST_HEAD")

In fact, instead of making sure no caller leaves an active list_head,
just force a list_del() in the callee. No one seems to need to access
the list after unregister_netdevice_many()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-06-09 05:15:14 +0800
3f17ea6de Merge branch 'next' (accumulated 3.16 merge window patches) into master ... Browse Code »

Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...

Linus Torvalds
2014-06-09 02:31:16 +0800

06 Jun, 2014

3 commits

f666f87b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/xen-netback/netback.c
net/core/filter.c

A filter bug fix overlapped some cleanups and a conversion
over to some new insn generation macros.

A xen-netback bug fix overlapped the addition of multi-queue
support.

Signed-off-by: David S. Miller

David S. Miller
2014-06-06 07:22:02 +0800
0dcceabb0 net: filter: fix SKF_AD_PKTTYPE extension on big-endian ... Browse Code »

BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since
pkt_type_offset() was failing to find skb->pkt_type field which is defined as:
__u8 pkt_type:3,
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1;

Fix it by searching for 3 most significant bits and shift them by 5 at run-time

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Tested-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-06-06 06:40:38 +0800
3b392ddba MPLS: Use mpls_features to activate software MPLS GSO segmentation ... Browse Code »

If an MPLS packet requires segmentation then use mpls_features
to determine if the software implementation should be used.

As no driver advertises MPLS GSO segmentation this will always be
the case.

I had not noticed that this was necessary before as software MPLS GSO
segmentation was already being used in my test environment. I believe that
the reason for that is the skbs in question always had fragments and the
driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
case for most drivers). Thus software segmentation was activated by
skb_gso_ok().

This introduces the overhead of an extra call to skb_network_protocol()
in the case where where CONFIG_NET_MPLS_GSO is set and
skb->ip_summed == CHECKSUM_NONE.

Thanks to Jesse Gross for prompting me to investigate this.

Signed-off-by: Simon Horman
Acked-by: YAMAMOTO Takashi
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Simon Horman
2014-06-06 06:05:09 +0800

05 Jun, 2014

2 commits

4cb28970a net: use the new API kvfree() ... Browse Code »

It is available since v3.15-rc5.

Cc: Pablo Neira Ayuso
Cc: "David S. Miller"
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2014-06-05 15:49:51 +0800
7e2b10c1e net: Support for multiple checksums with gso ... Browse Code »

When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.

This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.

This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2014-06-05 13:46:38 +0800

04 Jun, 2014

4 commits

c99f7abf0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c

Changes in net were fixing bugs in code removed in net-next.

Signed-off-by: David S. Miller

David S. Miller
2014-06-04 14:32:12 +0800
92ff71b8f net: remove some unless free on failure in alloc_netdev_mqs() ... Browse Code »

When we jump to free_pcpu on failure in alloc_netdev_mqs()
rx and tx queues are not yet allocated, so no need to free them.

Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2014-06-04 10:18:58 +0800
e51fb1523 rtnetlink: fix a memory leak when ->newlink fails ... Browse Code »

It is possible that ->newlink() fails before registering
the device, in this case we should just free it, it's
safe to call free_netdev().

Fixes: commit 0e0eee2465df77bcec2 (net: correct error path in rtnl_newlink())
Cc: David S. Miller
Cc: Eric Dumazet
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2014-06-04 10:16:10 +0800
776edb593 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/tip into next

Pull core locking updates from Ingo Molnar:
"The main changes in this cycle were:

- reduced/streamlined smp_mb__*() interface that allows more usecases
and makes the existing ones less buggy, especially in rarer
architectures

- add rwsem implementation comments

- bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
rwsem: Add comments to explain the meaning of the rwsem's count field
lockdep: Increase static allocations
arch: Mass conversion of smp_mb__*()
arch,doc: Convert smp_mb__*()
arch,xtensa: Convert smp_mb__*()
arch,x86: Convert smp_mb__*()
arch,tile: Convert smp_mb__*()
arch,sparc: Convert smp_mb__*()
arch,sh: Convert smp_mb__*()
arch,score: Convert smp_mb__*()
arch,s390: Convert smp_mb__*()
arch,powerpc: Convert smp_mb__*()
arch,parisc: Convert smp_mb__*()
arch,openrisc: Convert smp_mb__*()
arch,mn10300: Convert smp_mb__*()
arch,mips: Convert smp_mb__*()
arch,metag: Convert smp_mb__*()
arch,m68k: Convert smp_mb__*()
arch,m32r: Convert smp_mb__*()
arch,ia64: Convert smp_mb__*()
...

Linus Torvalds
2014-06-04 03:57:53 +0800

03 Jun, 2014

6 commits

014b20133 Merge branch 'ethtool-rssh-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/net-next ... Browse Code »

Ben Hutchings says:

====================
Pull request: Fixes for new ethtool RSS commands

This addresses several problems I previously identified with the new
ETHTOOL_{G,S}RSSH commands:

1. Missing validation of reserved parameters
2. Vague documentation
3. Use of unnamed magic number
4. No consolidation with existing driver operations

I don't currently have access to suitable network hardware, but have
tested these changes with a dummy driver that can support various
combinations of operations and sizes, together with (a) Debian's ethtool
3.13 (b) ethtool 3.14 with the submitted patch to use ETHTOOL_{G,S}RSSH
and minor adjustment for fixes 1 and 3.

v2: Update RSS operations in vmxnet3 too
====================

Signed-off-by: David S. Miller

David S. Miller
2014-06-03 14:07:02 +0800
f062a3844 ethtool: Check that reserved fields of struct ethtool_rxfh are 0 ... Browse Code »

We should fail rather than silently ignoring use of these extensions.

Signed-off-by: Ben Hutchings

Ben Hutchings
2014-06-03 09:43:16 +0800
fe62d0013 ethtool: Replace ethtool_ops::{get,set}_rxfh_indir() with {get,set}_rxfh() ... Browse Code »

ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
regardless of whether they expose the hash key, unless you try to
set a hash key for a driver that doesn't expose it.

Signed-off-by: Ben Hutchings
Acked-by: Jeff Kirsher

Ben Hutchings
2014-06-03 09:42:44 +0800
418c96ac1 net: filter: fix possible memory leak in __sk_prepare_filter() ... Browse Code »

__sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
rework/optimize internal BPF interpreter's instruction set) so that it should
have uncharged memory once things went wrong. However that work isn't complete.
Error is handled only in __sk_migrate_filter() while memory can still leak in
the error path right after sk_chk_filter().

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Leon Yu
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Leon Yu
2014-06-03 08:49:45 +0800
73f156a6e inetpeer: get rid of ip_id_count ... Browse Code »

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.

4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-06-03 02:00:41 +0800
670e5b8ea net: Add support for device specific address syncing ... Browse Code »

This change provides a function to be used in order to break the
ndo_set_rx_mode call into a set of address add and remove calls. The code
is based on the implementation of dev_uc_sync/dev_mc_sync. Since they
essentially do the same thing but with only one dev I simply named my
functions __dev_uc_sync/__dev_mc_sync.

I also implemented an unsync version of the functions as well to allow for
cleanup on close.

Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2014-06-03 01:40:54 +0800

02 Jun, 2014

3 commits

f8f6d679a net: filter: improve filter block macros ... Browse Code »

Commit 9739eef13c92 ("net: filter: make BPF conversion more readable")
started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
macros from classic BPF.

However, quite some statements in the filter conversion functions
remained in the old style which gives a mixture of block macros and
non block macros in the code. This patch makes the block macros itself
more readable by using explicit member initialization, and converts
the remaining ones where possible to remain in a more consistent state.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2014-06-02 13:16:58 +0800
348059313 net: filter: get rid of BPF_S_* enum ... Browse Code »

This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.

Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.

Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().

Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.

Joint work with Alexei Starovoitov.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Cc: Benjamin Herrenschmidt
Cc: Martin Schwidefsky
Cc: Mircea Gherzan
Cc: Kees Cook
Acked-by: Chema Gonzalez
Signed-off-by: David S. Miller

Daniel Borkmann
2014-06-02 13:16:58 +0800
4b9b1cdf8 net: fix wrong mac_len calculation for vlans ... Browse Code »

After 1e785f48d29a ("net: Start with correct mac_len in
skb_network_protocol") skb->mac_len is used as a start of the
calculation in skb_network_protocol() but that is not always correct. If
skb->protocol == 8021Q/AD, usually the vlan header is already inserted
in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
destination mac address (skb->data + VLAN_HLEN) as a type in
skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
skb_network_protocol() returns a type from inside the packet and
offset == 22. Also make vlan_depth unsigned as suggested before.
As suggested by Eric Dumazet, move the while() loop in the if() so we
can avoid additional testing in fast path.

Here are few netperf tests + debug printk's to illustrate:
cat netperf.tso-on.reorder-on.bugged
- Vlan -> device (reorder on, default, this case is okay)
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 7111.54
[ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
skb->mac_len 0 vlan_depth 0 type 0x800

- Vlan -> device (reorder off, bad)
cat netperf.tso-on.reorder-off.bugged
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 241.35
[ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
skb->mac_len 0 vlan_depth 4 type 0x5301
0x5301 are the last two bytes of the destination mac.

And if we stop TSO, we may get even the following:
[ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 18 vlan_depth 22 type 0xb84
Because mac_len already accounts for VLAN_HLEN.

After the fix:
cat netperf.tso-on.reorder-off.fixed
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.01 5001.46
[ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 0 vlan_depth 18 type 0x800

CC: Vlad Yasevich
CC: Eric Dumazet
CC: Daniel Borkman
CC: David S. Miller

Fixes:1e785f48d29a ("net: Start with correct mac_len in
skb_network_protocol")
Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2014-06-02 10:39:13 +0800

31 May, 2014

1 commit

484611e53 net: tso: Export symbols for modular build ... Browse Code »

Export the symbols to fix the below errors when built as modules:
ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
ERROR: "tso_start" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
ERROR: "tso_start" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!

Signed-off-by: Sachin Kamat
Acked-by: Ezequiel Garcia
Signed-off-by: David S. Miller

Sachin Kamat
2014-05-31 06:52:03 +0800

24 May, 2014

5 commits

54e5c4def Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c

Several cases of overlapping changes.

The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.

In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.

Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.

Signed-off-by: David S. Miller

David S. Miller
2014-05-24 12:32:30 +0800
b1fcd35cf net: filter: let unattached filters use sock_fprog_kern ... Browse Code »

The sk_unattached_filter_create() API is used by BPF filters that
are not directly attached or related to sockets, and are used in
team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
internal managment of obtaining filter blocks and thus already
have them in kernel memory and set up before calling into
sk_unattached_filter_create(). As a result, due to __user annotation
in sock_fprog, sparse triggers false positives (incorrect type in
assignment [different address space]) when filters are set up before
passing them to sk_unattached_filter_create(). Therefore, let
sk_unattached_filter_create() API use sock_fprog_kern to overcome
this issue.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2014-05-24 04:48:05 +0800
8556ce79d net: filter: remove DL macro ... Browse Code »

Lets get rid of this macro. After commit 5bcfedf06f7f ("net: filter:
simplify label names from jump-table"), labels have become more
readable due to omission of BPF_ prefix but at the same time more
generic, so that things like `git grep -n` would not find them. As
a middle path, lets get rid of the DL macro as it's not strictly
needed and would otherwise just hide the full name.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2014-05-24 04:48:05 +0800
28448b804 net: Split sk_no_check into sk_no_check_{rx,tx} ... Browse Code »

Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2014-05-24 04:28:53 +0800
ed616689a net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool. ... Browse Code »

o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed
to have a bandwidth of at least this value.
max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth
of up to this value.

o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced
which takes 4 arguments:
netdev, VF number, min_tx_rate, max_tx_rate

o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler.

o Drivers that currently implement ndo_set_vf_tx_rate should now call
ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth
greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet
implemented by driver.

o If user enters only one of either min_tx_rate or max_tx_rate, then,
userland should read back the other value from driver and set both
for IFLA_VF_RATE.
Drivers that have not yet implemented IFLA_VF_RATE should always
return min_tx_rate as 0 when read from ip tool.

o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then
IFLA_VF_RATE should override.

o Idea is to have consistent display of rate values to user.

o Usage example: -

./ip link set p4p1 vf 0 rate 900

./ip link show p4p1
32: p4p1: mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a

./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200

./ip link show p4p1
32: p4p1: mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a

./ip link set p4p1 vf 0 max_tx_rate 600 rate 300

./ip link show p4p1
32: p4p1: mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a

Signed-off-by: Sucheta Chakraborty
Signed-off-by: David S. Miller

Sucheta Chakraborty
2014-05-24 03:04:02 +0800

23 May, 2014

1 commit

e876f208a net: Add a software TSO helper API ... Browse Code »

Although the implementation probably needs a lot of work, this initial API
allows to implement software TSO in mvneta and mv643xx_eth drivers in a not
so intrusive way.

Signed-off-by: Ezequiel Garcia
Signed-off-by: David S. Miller

Ezequiel Garcia
2014-05-23 02:57:15 +0800

22 May, 2014

1 commit

5fe821a9d net: filter: cleanup invocation of internal BPF ... Browse Code »

Kernel API for classic BPF socket filters is:

sk_unattached_filter_create() - validate classic BPF, convert, JIT
SK_RUN_FILTER() - run it
sk_unattached_filter_destroy() - destroy socket filter

Cleanup internal BPF kernel API as following:

sk_filter_select_runtime() - final step of internal BPF creation.
Try to JIT internal BPF program, if JIT is not available select interpreter
SK_RUN_FILTER() - run it
sk_filter_free() - free internal BPF program

Disallow direct calls to BPF interpreter. Execution of the BPF program should
be done with SK_RUN_FILTER() macro.

Example of internal BPF create, run, destroy:

struct sk_filter *fp;

fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
fp->len = prog_len;

sk_filter_select_runtime(fp);

SK_RUN_FILTER(fp, ctx);

sk_filter_free(fp);

Sockets, seccomp, testsuite, tracing are using different ways to populate
sk_filter, so first steps of program creation are not common.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-05-22 05:07:17 +0800

19 May, 2014

1 commit

61d88c681 ethtool: Disallow ETHTOOL_SRSSH with both indir table and hash key unchanged ... Browse Code »

This would be a no-op, so there is no reason to request it.

This also allows conversion of the current implementations of
ethtool_ops::{get,set}_rxfh_indir to ethtool_ops::{get,set}_rxfh
with no change other than their parameters.

Signed-off-by: Ben Hutchings

Ben Hutchings
2014-05-19 08:29:42 +0800