Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

23 Dec, 2014

1 commit

da413eec7 packet: Fixed TPACKET V3 to signal poll when block is closed rather than every packet ... Browse Code »
13

Make TPACKET_V3 signal poll when block is closed rather than for every
packet. Side effect is that poll will be signaled when block retire
timer expires which didn't previously happen. Issue was visible when
sending packets at a very low frequency such that all blocks are retired
before packets are received by TPACKET_V3. This caused avoidable packet
loss. The fix ensures that the signal is sent when blocks are closed
which covers the normal path where the block is filled as well as the
path where the timer expires. The case where a block is filled without
moving to the next block (ie. all blocks are full) will still cause poll
to be signaled.

Signed-off-by: Dan Collins
Signed-off-by: David S. Miller

Dan Collins
2014-12-23 04:41:15 +0800

12 Dec, 2014

1 commit

70e71ca0a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.

This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.

3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.

4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.

5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.

6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.

7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.

8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.

9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...

Linus Torvalds
2014-12-12 06:27:06 +0800

10 Dec, 2014

1 commit

c0371da60 put iov_iter into msghdr ... Browse Code »
2

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro

Al Viro
2014-12-10 05:29:03 +0800

09 Dec, 2014

1 commit

dc9e51534 af_packet: virtio 1.0 stubs ... Browse Code »

This merely fixes sparse warnings, without actually
adding support for the new APIs.

Still working out the best way to enable the new
functionality.

Signed-off-by: Michael S. Tsirkin

Michael S. Tsirkin
2014-12-09 18:06:32 +0800

30 Nov, 2014

1 commit

60b7379dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2014-11-30 12:47:48 +0800

25 Nov, 2014

1 commit

6e58040b8 af_packet: fix sparse warning ... Browse Code »

af_packet produces lots of these:
net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers)
net/packet/af_packet.c:384:39: expected struct page [pure] *
net/packet/af_packet.c:384:39: got struct page *

this seems to be because sparse does not realize that _pure
refers to function, not the returned pointer.

Tweak code slightly to avoid the warning.

Signed-off-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Michael S. Tsirkin
2014-11-25 05:15:36 +0800

24 Nov, 2014

3 commits

8feb2fb2b switch AF_PACKET and AF_UNIX to skb_copy_datagram_from_iter() ... Browse Code »

... and kill skb_copy_datagram_iovec()

Signed-off-by: Al Viro

Al Viro
2014-11-24 18:16:39 +0800
7eab8d9e8 new helper: memcpy_to_msg() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-11-24 17:28:51 +0800
6ce8e9ce5 new helper: memcpy_from_msg() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-11-24 17:28:48 +0800

22 Nov, 2014

1 commit

9c7077622 packet: make packet_snd fail on len smaller than l2 header ... Browse Code »
23

When sending packets out with PF_PACKET, SOCK_RAW, ensure that the
packet is at least as long as the device's expected link layer header.
This check already exists in tpacket_snd, but not in packet_snd.
Also rate limit the warning in tpacket_snd.

Signed-off-by: Willem de Bruijn
Acked-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Willem de Bruijn
2014-11-22 03:43:07 +0800

06 Nov, 2014

1 commit

51f3d02b9 net: Add and use skb_copy_datagram_msg() helper. ... Browse Code »

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller

David S. Miller
2014-11-06 05:46:40 +0800

02 Sep, 2014

2 commits

fa2dbdc25 net: Pass a "more" indication down into netdev_start_xmit() code paths. ... Browse Code »

For now it will always be false.

Signed-off-by: David S. Miller

David S. Miller
2014-09-02 08:39:55 +0800
10b3ad8c2 net: Do txq_trans_update() in netdev_start_xmit() ... Browse Code »

That way we don't have to audit every call site to make sure it is
doing this properly.

Signed-off-by: David S. Miller

David S. Miller
2014-09-02 08:39:55 +0800

30 Aug, 2014

1 commit

10c51b562 net: add skb_get_tx_queue() helper ... Browse Code »

Replace occurences of skb_get_queue_mapping() and follow-up
netdev_get_tx_queue() with an actual helper function.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-08-30 11:02:07 +0800

25 Aug, 2014

1 commit

4798248e4 net: Add ops->ndo_xmit_flush() ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2014-08-25 14:02:45 +0800

22 Aug, 2014

1 commit

dc808110b packet: handle too big packets for PACKET_V3 ... Browse Code »
5

af_packet can currently overwrite kernel memory by out of bound
accesses, because it assumed a [new] block can always hold one frame.

This is not generally the case, even if most existing tools do it right.

This patch clamps too long frames as API permits, and issue a one time
error on syslog.

[ 394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82

In this example, packet header tp_snaplen was set to 3966,
and tp_len was set to 5042 (skb->len)

Signed-off-by: Eric Dumazet
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Acked-by: Daniel Borkmann
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Eric Dumazet
2014-08-22 07:44:28 +0800

30 Jul, 2014

1 commit

68a360e82 packet: remove deprecated syststamp timestamp ... Browse Code »

No device driver will ever return an skb_shared_info structure with
syststamp non-zero, so remove the branch that tests for this and
optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.

Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
may refer to it.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2014-07-30 02:39:50 +0800

16 Jul, 2014

1 commit

fe8c0f4ac packet: remove unnecessary break after return ... Browse Code »

Signed-off-by: Fabian Frederick
Signed-off-by: David S. Miller

Fabian Frederick
2014-07-16 07:26:59 +0800

25 Apr, 2014

2 commits

90f62cf30 net: Use netlink_ns_capable to verify the permisions of netlink messages ... Browse Code »
5

It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.

To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.

Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2014-04-25 01:44:54 +0800
a53b72c83 net: Move the permission check in sock_diag_put_filterinfo to packet_diag_dump ... Browse Code »
5

The permission check in sock_diag_put_filterinfo is wrong, and it is so removed
from it's sources it is not clear why it is wrong. Move the computation
into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo.

This does not yet correct the capability check but instead simply moves it to make
it clear what is going on.

Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2014-04-25 01:44:53 +0800

23 Apr, 2014

1 commit

78541c1dc net: Fix ns_capable check in sock_diag_put_filterinfo ... Browse Code »
5

The caller needs capabilities on the namespace being queried, not on
their own namespace. This is a security bug, although it likely has
only a minor impact.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Andrew Lutomirski
2014-04-23 00:49:39 +0800

12 Apr, 2014

1 commit

676d23690 net: Fix use after free by removing length arg from sk_data_ready callbacks. ... Browse Code »
13

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller

David S. Miller
2014-04-12 04:15:36 +0800

04 Apr, 2014

2 commits

8e2f1a63f packet: fix packet_direct_xmit for BQL enabled drivers ... Browse Code »

Currently, in packet_direct_xmit() we test the assigned netdevice queue
for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit().

This can have the side-effect that BQL enabled drivers which make use
of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from
within the stack and would not fully fill the device's TX ring from
packet sockets with PACKET_QDISC_BYPASS enabled.

Instead, use a test without BQL bit so that bursts can be absorbed
into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks!

Signed-off-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-04-04 02:29:12 +0800
0f97ede45 packet: report tx_dropped in packet_direct_xmit ... Browse Code »

Since commit 015f0688f57c ("net: net: add a core netdev->tx_dropped
counter"), we can now account for TX drops from within the core
stack instead of drivers.

Therefore, fix packet_direct_xmit() and increase drop count when we
encounter a problem before driver's xmit function was called (we do
not want to doubly account for it).

Suggested-by: Eric Dumazet
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-04-04 02:29:11 +0800

29 Mar, 2014

1 commit

43279500d packet: respect devices with LLTX flag in direct xmit ... Browse Code »

Quite often it can be useful to test with dummy or similar
devices as a blackhole sink for skbs. Such devices are only
equipped with a single txq, but marked as NETIF_F_LLTX as
they do not require locking their internal queues on xmit
(or implement locking themselves). Therefore, rather use
HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected.

trafgen mmap/TX_RING example against dummy device with config
foo: { fill(0xff, 64) } results in the following performance
improvements for such scenarios on an ordinary Core i7/2.80GHz:

Before:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

160,975,944,159 instructions:k # 0.55 insns per cycle ( +- 0.09% )
293,319,390,278 cycles:k # 0.000 GHz ( +- 0.35% )
192,501,104 branch-misses:k ( +- 1.63% )
831 context-switches:k ( +- 9.18% )
7 cpu-migrations:k ( +- 7.40% )
69,382 cache-misses:k # 0.010 % of all cache refs ( +- 2.18% )
671,552,021 cache-references:k ( +- 1.29% )

22.856401569 seconds time elapsed ( +- 0.33% )

After:

Performance counter stats for 'trafgen -i foo -o du0 -n100000000' (10 runs):

133,788,739,692 instructions:k # 0.92 insns per cycle ( +- 0.06% )
145,853,213,256 cycles:k # 0.000 GHz ( +- 0.17% )
59,867,100 branch-misses:k ( +- 4.72% )
384 context-switches:k ( +- 3.76% )
6 cpu-migrations:k ( +- 6.28% )
70,304 cache-misses:k # 0.077 % of all cache refs ( +- 1.73% )
90,879,408 cache-references:k ( +- 1.35% )

11.719372413 seconds time elapsed ( +- 0.24% )

Signed-off-by: Daniel Borkmann
Cc: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Daniel Borkmann
2014-03-29 04:49:48 +0800

27 Mar, 2014

1 commit

61b905da3 net: Rename skb->rxhash to skb->hash ... Browse Code »
14

The packet hash can be considered a property of the packet, not just
on RX path.

This patch changes name of rxhash and l4_rxhash skbuff fields to be
hash and l4_hash respectively. This includes changing uses of the
field in the code which don't call the access functions.

Signed-off-by: Tom Herbert
Signed-off-by: Eric Dumazet
Cc: Mahesh Bandewar
Signed-off-by: David S. Miller

Tom Herbert
2014-03-27 03:58:20 +0800

01 Mar, 2014

1 commit

52f1454f6 packet: allow to transmit +4 byte in TX_RING slot for VLAN case ... Browse Code »
18

Commit 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes
for VLAN packets.") added the possibility for non-mmaped frames to
send extra 4 byte for VLAN header so the MTU increases from 1500 to
1504 byte, for example.

Commit cbd89acb9eb2 ("af_packet: fix for sending VLAN frames via
packet_mmap") attempted to fix that for the mmap part but was
reverted as it caused regressions while using eth_type_trans()
on output path.

Lets just act analogous to 57f89bfa2140 and add a similar logic
to TX_RING. We presume size_max as overcharged with +4 bytes and
later on after skb has been built by tpacket_fill_skb() check
for ETH_P_8021Q header on packets larger than normal MTU. Can
be easily reproduced with a slightly modified trafgen in mmap(2)
mode, test cases:

{ fill(0xff, 12) const16(0x8100) fill(0xff, ) }
{ fill(0xff, 12) const16(0x0806) fill(0xff, ) }

Note that we need to do the test right after tpacket_fill_skb()
as sockets can have PACKET_LOSS set where we would not fail but
instead just continue to traverse the ring.

Reported-by: Mathias Kretschmer
Signed-off-by: Daniel Borkmann
Cc: Ben Greear
Cc: Phil Sutter
Tested-by: Mathias Kretschmer
Signed-off-by: David S. Miller

Daniel Borkmann
2014-03-01 05:52:02 +0800

19 Feb, 2014

1 commit

d7cf0c34a af_packet: remove a stray tab in packet_set_ring() ... Browse Code »

At first glance it looks like there is a missing curly brace but
actually the code works the same either way. I have adjusted the
indenting but left the code the same.

Signed-off-by: Dan Carpenter
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Dan Carpenter
2014-02-19 07:02:25 +0800

17 Feb, 2014

1 commit

0fd5d57ba packet: check for ndo_select_queue during queue selection ... Browse Code »

Mathias reported that on an AMD Geode LX embedded board (ALiX)
with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket
option"), triggers a WARN_ON() coming from the driver itself
via 066dae93bdf ("ath9k: rework tx queue selection and fix
queue stopping/waking").

The reason why this happened is that ndo_select_queue() call
is not invoked from direct xmit path i.e. for ieee80211 subsystem
that sets queue and TID (similar to 802.1d tag) which is being
put into the frame through 802.11e (WMM, QoS). If that is not
set, pending frame counter for e.g. ath9k can get messed up.

So the WARN_ON() in ath9k is absolutely legitimate. Generally,
the hw queue selection in ieee80211 depends on the type of
traffic, and priorities are set according to ieee80211_ac_numbers
mapping; working in a similar way as DiffServ only on a lower
layer, so that the AP can favour frames that have "real-time"
requirements like voice or video data frames.

Therefore, check for presence of ndo_select_queue() in netdev
ops and, if available, invoke it with a fallback handler to
__packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
or mlx4 can still select a hw queue for transmission in
relation to the current CPU while e.g. ieee80211 subsystem
can make their own choices.

Reported-by: Mathias Kretschmer
Signed-off-by: Daniel Borkmann
Cc: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Daniel Borkmann
2014-02-17 13:36:34 +0800

23 Jan, 2014

1 commit

2d36097d2 af_packet: Add Queue mapping mode to af_packet fanout operation ... Browse Code »

This patch adds a queue mapping mode to the fanout operation of af_packet
sockets. This allows user space af_packet users to better filter on flows
ingressing and egressing via a specific hardware queue, and avoids the potential
packet reordering that can occur when FANOUT_CPU is being used and irq affinity
varies.

Tested successfully by myself. applies to net-next

Signed-off-by: Neil Horman
CC: "David S. Miller"
Signed-off-by: David S. Miller

Neil Horman
2014-01-23 09:35:50 +0800

22 Jan, 2014

3 commits

89770b0a6 net: introduce reciprocal_scale helper and convert users ... Browse Code »

As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

[1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

Suggested-by: David Laight
Cc: Jakub Zawadzki
Cc: Eric Dumazet
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-22 15:17:20 +0800
f337db64a random32: add prandom_u32_max and convert open coded users ... Browse Code »

Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki
Cc: Eric Dumazet
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-22 15:17:20 +0800
f0d4eb29d packet: fix a couple of cppcheck warnings ... Browse Code »

Doesn't bring much, but also doesn't hurt us to fix 'em:

1) In tpacket_rcv() flush dcache page we can restirct the scope
for start and end and remove one layer of indent.

2) In tpacket_destruct_skb() we can restirct the scope for ph.

3) In alloc_one_pg_vec_page() we can remove the NULL assignment
and change spacing a bit.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-22 08:51:42 +0800

19 Jan, 2014

1 commit

342dfc306 net: add build-time checks for msg->msg_name size ... Browse Code »

This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.

Signed-off-by: Steffen Hurrle
Suggested-by: Hannes Frederic Sowa
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Steffen Hurrle
2014-01-19 15:04:16 +0800

17 Jan, 2014

3 commits

b01384081 packet: use percpu mmap tx frame pending refcount ... Browse Code »

In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch: 4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
real 1m32.241s
user 0m0.287s
sys 1m29.316s

Without patch:
real 1m38.386s
user 0m0.265s
sys 1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

[1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-17 08:17:12 +0800
87a2fd286 packet: don't unconditionally schedule() in case of MSG_DONTWAIT ... Browse Code »

In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.

While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.

Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-17 08:17:11 +0800
902fefb82 packet: improve socket create/bind latency in some cases ... Browse Code »
13

Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...

memset(&sll, 0, sizeof(sll));
sll.sll_family = AF_PACKET;
sll.sll_ifindex = ifindex;
sll.sll_protocol = htons(ETH_P_ALL);

... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.

In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.

Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).

In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.

In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt() -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.

Timings on an i7-3520M machine:

* sock-bind-1: 89 us
* sock-bind-2: 7447 us
* sock-bind-3: 75 us

sock-bind-1:
socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-2:
socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-3:
socket(PF_PACKET, SOCK_RAW, 0) = 3
bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-01-17 08:17:11 +0800

01 Jan, 2014

1 commit

d4dd8aeef packet: fix "foo * bar" and "(foo*)" problems ... Browse Code »

Cleanup checkpatch errors.Specially,the second changed line
is exactly 80 columns long.

Signed-off-by: Weilong Chen
Signed-off-by: David S. Miller

Weilong Chen
2014-01-01 02:38:41 +0800

18 Dec, 2013

2 commits

a0cdfcf39 packet: deliver VLAN TPID to userspace ... Browse Code »

This enables userspace to get VLAN TPID as well as the VLAN TCI.

Signed-off-by: Atzm Watanabe
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Atzm Watanabe
2013-12-18 13:36:16 +0800
e4d26f4b0 packet: fill the gap of TPACKET_ALIGNMENT with zeros ... Browse Code »

struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
Explicitly defining and zeroing the gap of this makes additional changes
easier.

Signed-off-by: Atzm Watanabe
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Atzm Watanabe
2013-12-18 13:36:16 +0800