Eric Lee / smarc-fsl-linux-kernel

12 Dec, 2011

1 commit

dfd56b8b3 net: use IS_ENABLED(CONFIG_IPV6) ... Browse Code »

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-12-12 07:25:16 +0800

10 Nov, 2011

1 commit

d826eb14e ipv4: PKTINFO doesnt need dst reference ... Browse Code »

Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-11-10 05:36:27 +0800

24 Oct, 2011

1 commit

66b13d99d ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT ... Browse Code »

There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
- Routing using TOS as a selector
- Firewalls
- Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
- We still reflect incoming TOS in RST messages.
- We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
- A patch is needed for IPv6

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-24 15:06:21 +0800

19 Oct, 2011

1 commit

bc416d976 macvlan: handle fragmented multicast frames ... Browse Code »
43

Fragmented multicast frames are delivered to a single macvlan port,
because ip defrag logic considers other samples are redundant.

Implement a defrag step before trying to send the multicast frame.

Reported-by: Ben Greear
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-10-19 11:22:07 +0800

06 Jul, 2011

1 commit

595fc71ba ipv4: Add ip_defrag() agent IP_DEFRAG_AF_PACKET. ... Browse Code »

Elide the ICMP on frag queue timeouts unconditionally for
this user.

Signed-off-by: David S. Miller

David S. Miller
2011-07-06 13:34:52 +0800

24 Jun, 2011

1 commit

d18cd551d net: Fix build failures due to ip_is_fragment() ... Browse Code »

It needs to be available even when CONFIG_INET is not set.

Reported-by: Stephen Rothwell
Reported-by: Randy Dunlap
Signed-off-by: David S. Miller

David S. Miller
2011-06-24 12:28:52 +0800

22 Jun, 2011

1 commit

56f8a75c1 ip: introduce ip_is_fragment helper inline function ... Browse Code »

There are enough instances of this:

iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker
Signed-off-by: David S. Miller

Paul Gortmaker
2011-06-22 11:33:34 +0800

09 Jun, 2011

1 commit

4b9d9be83 inetpeer: remove unused list ... Browse Code »

Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.

It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.

Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.

This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.

There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.

Signed-off-by: Eric Dumazet
CC: Andi Kleen
CC: Tim Chen
Signed-off-by: David S. Miller

Eric Dumazet
2011-06-09 08:05:30 +0800

11 May, 2011

1 commit

0a5ebb800 ipv4: Pass explicit daddr arg to ip_send_reply(). ... Browse Code »

This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller

David S. Miller
2011-05-11 04:32:46 +0800

09 May, 2011

3 commits

f5fca6086 ipv4: Pass flow key down into ip_append_*(). ... Browse Code »

This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 12:24:07 +0800
77968b782 ipv4: Pass flow keys down into datagram packet building engine. ... Browse Code »

This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 12:24:06 +0800
d9d8da805 inet: Pass flowi to ->queue_xmit(). ... Browse Code »

This allows us to acquire the exact route keying information from the
protocol, however that might be managed.

It handles all of the possibilities, from the simplest case of storing
the key in inet->cork.fl to the more complex setup SCTP has where
individual transports determine the flow.

Signed-off-by: David S. Miller

David S. Miller
2011-05-09 06:28:28 +0800

07 May, 2011

1 commit

bdc712b4c inet: Decrease overhead of on-stack inet_cork. ... Browse Code »

When we fast path datagram sends to avoid locking by putting
the inet_cork on the stack we use up lots of space that isn't
necessary.

This is because inet_cork contains a "struct flowi" which isn't
used in these code paths.

Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
Only the latter of which has the "struct flowi" and is what is
stored in inet_sock.

Signed-off-by: David S. Miller
Acked-by: Eric Dumazet

David S. Miller
2011-05-07 06:37:57 +0800

29 Apr, 2011

1 commit

f6d8bd051 inet: add RCU protection to inet->opt ... Browse Code »

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet
Cc: Herbert Xu
Signed-off-by: David S. Miller

Eric Dumazet
2011-04-29 04:16:35 +0800

30 Mar, 2011

1 commit

93ca3bb5d net: gre: provide multicast mappings for ipv4 and ipv6 ... Browse Code »

My commit 6d55cb91a0020ac0 (gre: fix hard header destination
address checking) broke multicast.

The reason is that ip_gre used to get ipgre_header() calls with
zero destination if we have NOARP or multicast destination. Instead
the actual target was decided at ipgre_tunnel_xmit() time based on
per-protocol dissection.

Instead of allowing the "abuse" of ->header() calls with invalid
destination, this creates multicast mappings for ip_gre. This also
fixes "ip neigh show nud noarp" to display the proper multicast
mappings used by the gre device.

Reported-by: Doug Kehn
Signed-off-by: Timo Teräs
Acked-by: Doug Kehn
Signed-off-by: David S. Miller

Timo Teräs
2011-03-30 15:10:47 +0800

02 Mar, 2011

1 commit

1c32c5ad6 inet: Add ip_make_skb and ip_finish_skb ... Browse Code »

This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced. The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

This patch also adds the helper ip_finish_skb which is meant to
be replace ip_push_pending_frames when corking is required.
Previously the protocol stack would peek at the socket write
queue and add its header to the first packet. With ip_finish_skb,
the protocol stack can directly operate on the final skb instead,
just like the non-corking case with ip_make_skb.

Signed-off-by: Herbert Xu
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Herbert Xu
2011-03-02 04:35:03 +0800

13 Dec, 2010

1 commit

323e126f0 ipv4: Don't pre-seed hoplimit metric. ... Browse Code »

Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

This allowed several simplifications:

1) The interim dst_metric_hoplimit() can go as it's no longer
userd.

2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.

3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.

When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.

We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h

Signed-off-by: David S. Miller

David S. Miller
2010-12-13 14:08:17 +0800

26 Oct, 2010

1 commit

43a951e99 ipv4: add __rcu annotations to ip_ra_chain ... Browse Code »

Add __rcu annotations to :
(struct ip_ra_chain)->next
struct ip_ra_chain *ip_ra_chain;

And use appropriate rcu primitives.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-10-26 05:18:28 +0800

24 Sep, 2010

1 commit

a02cec215 net: return operator cleanup ... Browse Code »

Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-09-24 05:33:39 +0800

19 Aug, 2010

1 commit

2244d07bf net: simplify flags for tx timestamping ... Browse Code »

This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp
Signed-off-by: David S. Miller

Oliver Hartkopp
2010-08-19 15:08:30 +0800

01 Jul, 2010

1 commit

4ce3c183f snmp: 64bit ipstats_mib for all arches ... Browse Code »
43

/proc/net/snmp and /proc/net/netstat expose SNMP counters.

Width of these counters is either 32 or 64 bits, depending on the size
of "unsigned long" in kernel.

This means user program parsing these files must already be prepared to
deal with 64bit values, regardless of user program being 32 or 64 bit.

This patch introduces 64bit snmp values for IPSTAT mib, where some
counters can wrap pretty fast if they are 32bit wide.

# netstat -s|egrep "InOctets|OutOctets"
InOctets: 244068329096
OutOctets: 244069348848

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-07-01 04:31:19 +0800

26 Jun, 2010

1 commit

1823e4c80 snmp: add align parameter to snmp_mib_init() ... Browse Code »

In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.

Callers can use __alignof__(type) to provide correct
alignment.

Signed-off-by: Eric Dumazet
CC: Herbert Xu
CC: Arnaldo Carvalho de Melo
CC: Hideaki YOSHIFUJI
CC: Vlad Yasevich
Signed-off-by: David S. Miller

Eric Dumazet
2010-06-26 12:33:17 +0800

11 Jun, 2010

1 commit

592fcb9df ip: ip_ra_control() rcu fix ... Browse Code »

commit 66018506e15b (ip: Router Alert RCU conversion) introduced RCU
lookups to ip_call_ra_chain(). It missed proper deinit phase :
When ip_ra_control() deletes an ip_ra_chain, it should make sure
ip_call_ra_chain() users can not start to use socket during the rcu
grace period. It should also delay the sock_put() after the grace
period, or we risk a premature socket freeing and corruptions, as
raw sockets are not rcu protected yet.

This delay avoids using expensive atomic_inc_not_zero() in
ip_call_ra_chain().

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-06-11 13:47:08 +0800

08 Jun, 2010

1 commit

66018506e ip: Router Alert RCU conversion ... Browse Code »

Straightforward conversion to RCU.

One rwlock becomes a spinlock, and is static.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-06-08 12:25:21 +0800

25 May, 2010

1 commit

4be929be3 kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN ... Browse Code »

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan
Acked-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2010-05-25 23:07:02 +0800

16 May, 2010

1 commit

e3826f1e9 net: reserve ports for applications using fixed port numbers ... Browse Code »

(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

Signed-off-by: Octavian Purdila
Signed-off-by: WANG Cong
Cc: Neil Horman
Cc: Eric Dumazet
Cc: Eric W. Biederman
Signed-off-by: David S. Miller

Amerigo Wang
2010-05-16 14:28:40 +0800

29 Apr, 2010

1 commit

f84af32cb net: ip_queue_rcv_skb() helper ... Browse Code »

When queueing a skb to socket, we can immediately release its dst if
target socket do not use IP_CMSG_PKTINFO.

tcp_data_queue() can drop dst too.

This to benefit from a hot cache line and avoid the receiver, possibly
on another cpu, to dirty this cache line himself.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-29 06:31:51 +0800

16 Apr, 2010

1 commit

4e15ed4d9 net: replace ipfragok with skb->local_df ... Browse Code »

As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

The patch kills the ipfragok parameter of .queue_xmit().

Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2010-04-16 14:36:37 +0800

17 Feb, 2010

1 commit

7d720c3e4 percpu: add __percpu sparse annotations to net ... Browse Code »

Add __percpu sparse annotations to net.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

The macro and type tricks around snmp stats make things a bit
interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
snmp_mib_*() users which used to cast the argument to (void **) are
updated to cast it to (void __percpu **).

Signed-off-by: Tejun Heo
Acked-by: David S. Miller
Cc: Patrick McHardy
Cc: Arnaldo Carvalho de Melo
Cc: Vlad Yasevich
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller

Tejun Heo
2010-02-17 15:05:38 +0800

16 Feb, 2010

1 commit

5d0aa2ccd netfilter: nf_conntrack: add support for "conntrack zones" ... Browse Code »

Normally, each connection needs a unique identity. Conntrack zones allow
to specify a numerical zone using the CT target, connections in different
zones can use the same identity.

Example:

iptables -t raw -A PREROUTING -i veth0 -j CT --zone 1
iptables -t raw -A OUTPUT -o veth1 -j CT --zone 1

Signed-off-by: Patrick McHardy

Patrick McHardy
2010-02-16 01:13:33 +0800

14 Jan, 2010

1 commit

cd65c3c7d net: fix build erros with CONFIG_BUG=n, CONFIG_GENERIC_BUG=n ... Browse Code »

Fixed build errors introduced by commit 7ad6848c (ip: fix mc_loop
checks for tunnels with multicast outer addresses)

Signed-off-by: Octavian Purdila
Signed-off-by: David S. Miller

Octavian Purdila
2010-01-14 10:10:36 +0800

07 Jan, 2010

1 commit

7ad6848c7 ip: fix mc_loop checks for tunnels with multicast outer addresses ... Browse Code »

When we have L3 tunnels with different inner/outer families
(i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
destination address, multicast packets will be loopbacked back to the
sending socket even if IP*_MULTICAST_LOOP is set to disabled.

The mc_loop flag is present in the family specific part of the socket
(e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner
family mc_loop flag. When the packet is pushed through the L3 tunnel
it will eventually be processed by the outer family which if different
will check the flag in a different part of the socket then it was set.

Signed-off-by: Octavian Purdila
Signed-off-by: David S. Miller

Octavian Purdila
2010-01-07 12:37:01 +0800

15 Dec, 2009

1 commit

8fa9ff684 netfilter: fix crashes in bridge netfilter caused by fragment jumps ... Browse Code »

When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack
and a reassembly queue with the same fragment key already exists from
reassembling a similar packet received on a different device (f.i. with
multicasted fragments), the reassembled packet might continue on a different
codepath than where the head fragment originated. This can cause crashes
in bridge netfilter when a fragment received on a non-bridge device (and
thus with skb->nf_bridge == NULL) continues through the bridge netfilter
code.

Add a new reassembly identifier for packets originating from bridge
netfilter and use it to put those packets in insolated queues.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805

Reported-and-Tested-by: Chong Qiao
Signed-off-by: Patrick McHardy

Patrick McHardy
2009-12-15 23:59:59 +0800

04 Nov, 2009

1 commit

fd2c3ef76 net: cleanup include/net ... Browse Code »

This cleanup patch puts struct/union/enum opening braces,
in first line to ease grep games.

struct something
{

becomes :

struct something {

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-11-04 21:06:25 +0800

19 Oct, 2009

1 commit

c720c7e83 inet: rename some inet_sock fields ... Browse Code »

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-10-19 09:52:53 +0800

01 Oct, 2009

1 commit

b7058842c net: Make setsockopt() optlen be unsigned. ... Browse Code »

This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller

David S. Miller
2009-10-01 07:12:20 +0800

24 Sep, 2009

1 commit

8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800

27 Apr, 2009

1 commit

edf391ff1 snmp: add missing counters for RFC 4293 ... Browse Code »

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.

Signed-off-by: Neil Horman
Signed-off-by: David S. Miller

Neil Horman
2009-04-27 17:45:02 +0800

16 Feb, 2009

1 commit

51f31cabe ip: support for TX timestamps on UDP and RAW sockets ... Browse Code »

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly
Signed-off-by: David S. Miller

Patrick Ohly
2009-02-16 14:43:38 +0800

26 Nov, 2008

1 commit

b27aeadb5 netns xfrm: per-netns sysctls ... Browse Code »

Make
net.core.xfrm_aevent_etime
net.core.xfrm_acq_expires
net.core.xfrm_aevent_rseqth
net.core.xfrm_larval_drop

sysctls per-netns.

For that make net_core_path[] global, register it to prevent two
/proc/net/core antries and change initcall position -- xfrm_init() is called
from fs_initcall, so this one should be fs_initcall at least.

Signed-off-by: Alexey Dobriyan
Signed-off-by: David S. Miller

Alexey Dobriyan
2008-11-26 10:00:48 +0800