17 May, 2016
1 commit
-
__sock_cmsg_send() might return different error codes, not only -EINVAL.
Fixes: 24025c465f77 ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a83716 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet
Cc: Soheil Hassas Yeganeh
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
12 May, 2016
1 commit
-
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.Signed-off-by: David Ahern
Signed-off-by: David S. Miller
26 Apr, 2016
1 commit
-
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
14 Apr, 2016
1 commit
-
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller
08 Apr, 2016
1 commit
-
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
05 Apr, 2016
1 commit
-
Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller
23 Feb, 2016
1 commit
-
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.cAll three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller
17 Feb, 2016
1 commit
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
13 Feb, 2016
1 commit
-
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
11 Feb, 2016
1 commit
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
05 Nov, 2015
1 commit
-
Sasha reported the following lockdep warning:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
lock(rtnl_mutex);This is due to that for IP_MSFILTER and MCAST_MSFILTER, we take
rtnl lock before the socket lock in setsockopt() path, but take
the socket lock before rtnl lock in getsockopt() path. All the
rest optnames are setsockopt()-only.Fix this by aligning the getsockopt() path with the setsockopt()
path, so that all mcast socket path would be locked in the same
order.Note, IPv6 part is different where rtnl lock is not held.
Fixes: 54ff9ef36bdf ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}")
Reported-by: Sasha Levin
Cc: Marcelo Ricardo Leitner
Signed-off-by: Cong Wang
Reviewed-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller
24 Jun, 2015
2 commits
-
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.cBoth conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller
-
ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling")
Signed-off-by: Julian Anastasov
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller
07 Jun, 2015
1 commit
-
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 socketslpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66Check that a given source port is indeed used by many different
connections :lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
04 Apr, 2015
2 commits
-
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.No changes detected by objdiff.
Signed-off-by: Ian Morris
Signed-off-by: David S. Miller -
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.No changes detected by objdiff.
Signed-off-by: Ian Morris
Signed-off-by: David S. Miller
19 Mar, 2015
2 commits
-
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}Signed-off-by: Marcelo Ricardo Leitner
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller -
There are some setsockopt operations in ipv4 and ipv6 that are grabbing
rtnl after having grabbed the socket lock. Yet this makes it impossible
to do operations that have to lock the socket when already within a rtnl
protected scope, like ndo dev_open and dev_stop.We normally take coarse grained locks first but setsockopt inverted that.
So this patch invert the lock logic for these operations and makes
setsockopt grab rtnl if it will be needed prior to grabbing socket lock.Signed-off-by: Marcelo Ricardo Leitner
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
09 Mar, 2015
1 commit
-
When reading from the error queue, msg_name and msg_control are only
populated for some errors. A new exception for empty timestamp skbs
added a false positive on icmp errors without payload.`traceroute -M udpconn` only displayed gateways that return payload
with the icmp error: the embedded network headers are pulled before
sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise.Fix this regression by refining when msg_name and msg_control
branches are taken. The solutions for the two fields are independent.msg_name only makes sense for errors that configure serr->port and
serr->addr_offset. Test the first instead of skb->len. This also fixes
another issue. saddr could hold the wrong data, as serr->addr_offset
is not initialized in some code paths, pointing to the start of the
network header. It is only valid when serr->port is set (non-zero).msg_control support differs between IPv4 and IPv6. IPv4 only honors
requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The
skb->len test can simply be removed, because skb->dev is also tested
and never true for empty skbs. IPv6 honors requests for all errors
aside from local errors and timestamps on empty skbs.In both cases, make the policy more explicit by moving this logic to
a new function that decides whether to process msg_control and that
optionally prepares the necessary fields in skb->cb[]. After this
change, the IPv4 and IPv6 paths are more similar.The last case is rxrpc. Here, simply refine to only match timestamps.
Fixes: 49ca0d8bfaf3 ("net-timestamp: no-payload option")
Reported-by: Jan Niehusmann
Signed-off-by: Willem de Bruijn----
Changes
v1->v2
- fix local origin test inversion in ip6_datagram_support_cmsg
- make v4 and v6 code paths more similar by introducing analogous
ipv4_datagram_support_cmsg
- fix compile bug in rxrpc
Signed-off-by: David S. Miller
03 Feb, 2015
1 commit
-
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.Signed-off-by: Willem de Bruijn
----
Changes (rfc -> v1)
- add documentation
- remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: David S. Miller
28 Jan, 2015
1 commit
-
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.cTwo simple sets of overlapping changes.
Signed-off-by: David S. Miller
16 Jan, 2015
1 commit
-
The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
structure is defined and allocated on the stack asstruct {
struct sock_extended_err ee;
struct sockaddr_in(6) offender;
} errhdr;The second part is only initialized for certain SO_EE_ORIGIN values.
Always initialize it completely.An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
would return uninitialized bytes.Signed-off-by: Willem de Bruijn
----
Also verified that there is no padding between errhdr.ee and
errhdr.offender that could leak additional kernel data.
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
06 Jan, 2015
3 commits
-
Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
they can be referenced in other source files.Restructure ip_cmsg_recv to not go through flags using shift, check
for flags by 'and'. This eliminates both the shift and a conditional
per flag check.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller
11 Dec, 2014
1 commit
-
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
cmsghdr from msghdr, just cleanup.Signed-off-by: Gu Zheng
Signed-off-by: David S. Miller
09 Dec, 2014
2 commits
-
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.Signed-off-by: Willem de Bruijn
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initializedThe recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller -
One line change, in response to catching an occurrence of this bug.
See also fix f4713a3dfad0 ("net-timestamp: make tcp_recvmsg call ...")Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller
14 Nov, 2014
1 commit
-
Conflicts:
drivers/net/ethernet/chelsio/cxgb4vf/sge.c
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.csge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.ixgbe_phy.c was a set of overlapping whitespace changes.
Signed-off-by: David S. Miller
12 Nov, 2014
1 commit
-
Use IS_ENABLED(CONFIG_IPV6), to enable this code if IPv6 is
a module.Signed-off-by: Eric Dumazet
Fixes: c8e6ad0829a7 ("ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg")
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
06 Nov, 2014
1 commit
-
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.Having a helper like this means there will be less places to touch
during that transformation.Based upon descriptions and patch from Al Viro.
Signed-off-by: David S. Miller
10 Sep, 2014
1 commit
-
Remove one sparse warning :
net/ipv4/ip_sockglue.c:328:22: warning: incorrect type in assignment (different address spaces)
net/ipv4/ip_sockglue.c:328:22: expected struct ip_ra_chain [noderef] *next
net/ipv4/ip_sockglue.c:328:22: got struct ip_ra_chain *[assigned] raAnd replace one rcu_assign_ptr() by RCU_INIT_POINTER() where applicable.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
02 Sep, 2014
1 commit
-
sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).This moves signal generation for the next packet forward, which should
be harmless.It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller
30 Jul, 2014
1 commit
-
Sparse warns because of implicit pointer cast.
v2: subject line correction, space between "void" and "*"
Signed-off-by: Karoly Kemeny
Signed-off-by: David S. Miller
27 Feb, 2014
1 commit
-
IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
generation of fragments if the interface mtu is exceeded, it is very
hard to make use of this option in already deployed name server software
for which I introduced this option.This patch adds yet another new IP_MTU_DISCOVER option to not honor any
path mtu information and not accepting new icmp notifications destined for
the socket this option is enabled on. But we allow outgoing fragmentation
in case the packet size exceeds the outgoing interface mtu.As such this new option can be used as a drop-in replacement for
IP_PMTUDISC_DONT, which is currently in use by most name server software
making the adoption of this option very smooth and easy.The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
ignoring incoming path MTU updates and not honoring discovered path MTUs
in the output path.Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
Cc: Florian Weimer
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
20 Feb, 2014
1 commit
-
In case we decide in udp6_sendmsg to send the packet down the ipv4
udp_sendmsg path because the destination is either of family AF_INET or
the destination is an ipv4 mapped ipv6 address, we don't honor the
maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.We simply can check for this option in ip_cmsg_send because no calls to
ipv6 module functions are needed to do so.Reported-by: Gert Doering
Cc: Tore Anderson
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
20 Jan, 2014
1 commit
-
We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
for IPv4 datagrams on IPv6 sockets.This patch splits the ip6_datagram_recv_ctl into two functions, one
which handles both protocol families, AF_INET and AF_INET6, while the
ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.ip6_datagram_recv_*_ctl never reported back any errors, so we can make
them return void. Also provide a helper for protocols which don't offer dual
personality to further use ip6_datagram_recv_ctl, which is exported to
modules.I needed to shuffle the code for ping around a bit to make it easier to
implement dual personality for ping ipv6 sockets in future.Reported-by: Gert Doering
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
19 Jan, 2014
1 commit
-
This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
handler msg_name and msg_namelen logic").DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.Signed-off-by: Steffen Hurrle
Suggested-by: Hannes Frederic Sowa
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
11 Dec, 2013
1 commit
-
Various spelling fixes in networking stack
Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
24 Nov, 2013
1 commit
-
Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.This broke traceroute and such.
Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler
Reported-by: Tom Labanowski
Cc: mpb
Cc: David S. Miller
Cc: Eric Dumazet
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller