17 May, 2016

1 commit

  • __sock_cmsg_send() might return different error codes, not only -EINVAL.

    Fixes: 24025c465f77 ("ipv4: process socket-level control messages in IPv4")
    Fixes: ad1e46a83716 ("ipv6: process socket-level control messages in IPv6")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 May, 2016

1 commit

  • Applications such as OSPF and BFD need the original ingress device not
    the VRF device; the latter can be derived from the former. To that end
    add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
    the skb control buffer similar to IPv6. From there the pktinfo can just
    pull it from cb with the PKTINFO_SKB_CB cast.

    The previous patch moving the skb->dev change to L3 means nothing else
    is needed for IPv6; it just works.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

26 Apr, 2016

1 commit


14 Apr, 2016

1 commit

  • On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
    the packet payload. Since commit e6afc8ace6dd pulled the headers,
    taking skb->data as the start of transport header is incorrect. Use
    the transport header pointer.

    Also, when peeking at an offset from the start of the packet, only
    return a checksum from the start of the peeked data. Note that the
    cmsg does not subtract a tail checkum when reading truncated data.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

08 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • Process socket-level control messages by invoking
    __sock_cmsg_send in ip_cmsg_send for control messages on
    the SOL_SOCKET layer.

    This makes sure whenever ip_cmsg_send is called in udp, icmp,
    and raw, we also process socket-level control messages.

    Note that this commit interprets new control messages that
    were ignored before. As such, this commit does not change
    the behavior of IPv4 control messages.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

23 Feb, 2016

1 commit


17 Feb, 2016

1 commit


13 Feb, 2016

1 commit

  • Dmitry reported memory leaks of IP options allocated in
    ip_cmsg_send() when/if this function returns an error.

    Callers are responsible for the freeing.

    Many thanks to Dmitry for the report and diagnostic.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2016

1 commit


05 Nov, 2015

1 commit

  • Sasha reported the following lockdep warning:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(sk_lock-AF_INET);
    lock(rtnl_mutex);
    lock(sk_lock-AF_INET);
    lock(rtnl_mutex);

    This is due to that for IP_MSFILTER and MCAST_MSFILTER, we take
    rtnl lock before the socket lock in setsockopt() path, but take
    the socket lock before rtnl lock in getsockopt() path. All the
    rest optnames are setsockopt()-only.

    Fix this by aligning the getsockopt() path with the setsockopt()
    path, so that all mcast socket path would be locked in the same
    order.

    Note, IPv6 part is different where rtnl lock is not held.

    Fixes: 54ff9ef36bdf ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}")
    Reported-by: Sasha Levin
    Cc: Marcelo Ricardo Leitner
    Signed-off-by: Cong Wang
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    WANG Cong
     

24 Jun, 2015

2 commits


07 Jun, 2015

1 commit

  • When an application needs to force a source IP on an active TCP socket
    it has to use bind(IP, port=x).

    As most applications do not want to deal with already used ports, x is
    often set to 0, meaning the kernel is in charge to find an available
    port.
    But kernel does not know yet if this socket is going to be a listener or
    be connected.
    It has very limited choices (no full knowledge of final 4-tuple for a
    connect())

    With limited ephemeral port range (about 32K ports), it is very easy to
    fill the space.

    This patch adds a new SOL_IP socket option, asking kernel to ignore
    the 0 port provided by application in bind(IP, port=0) and only
    remember the given IP address.

    The port will be automatically chosen at connect() time, in a way
    that allows sharing a source port as long as the 4-tuples are unique.

    This new feature is available for both IPv4 and IPv6 (Thanks Neal)

    Tested:

    Wrote a test program and checked its behavior on IPv4 and IPv6.

    strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
    connect().
    Also getsockname() show that the port is still 0 right after bind()
    but properly allocated after connect().

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
    setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
    connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0

    IPv6 test :

    socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
    setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
    connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

    I was able to bind()/connect() a million concurrent IPv4 sockets,
    instead of ~32000 before patch.

    lpaa23:~# ulimit -n 1000010
    lpaa23:~# ./bind --connect --num-flows=1000000 &
    1000000 sockets

    lpaa23:~# grep TCP /proc/net/sockstat
    TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66

    Check that a given source port is indeed used by many different
    connections :

    lpaa23:~# ss -t src :40000 | head -10
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
    ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Apr, 2015

2 commits

  • The ipv4 code uses a mixture of coding styles. In some instances check
    for non-NULL pointer is done as x != NULL and sometimes as x. x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     
  • The ipv4 code uses a mixture of coding styles. In some instances check
    for NULL pointer is done as x == NULL and sometimes as !x. !x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

19 Mar, 2015

2 commits

  • in favor of their inner __ ones, which doesn't grab rtnl.

    As these functions need to operate on a locked socket, we can't be
    grabbing rtnl by then. It's too late and doing so causes reversed
    locking.

    So this patch:
    - move rtnl handling to callers instead while already fixing some
    reversed locking situations, like on vxlan and ipvs code.
    - renames __ ones to not have the __ mark:
    __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
    __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • There are some setsockopt operations in ipv4 and ipv6 that are grabbing
    rtnl after having grabbed the socket lock. Yet this makes it impossible
    to do operations that have to lock the socket when already within a rtnl
    protected scope, like ndo dev_open and dev_stop.

    We normally take coarse grained locks first but setsockopt inverted that.

    So this patch invert the lock logic for these operations and makes
    setsockopt grab rtnl if it will be needed prior to grabbing socket lock.

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

09 Mar, 2015

1 commit

  • When reading from the error queue, msg_name and msg_control are only
    populated for some errors. A new exception for empty timestamp skbs
    added a false positive on icmp errors without payload.

    `traceroute -M udpconn` only displayed gateways that return payload
    with the icmp error: the embedded network headers are pulled before
    sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise.

    Fix this regression by refining when msg_name and msg_control
    branches are taken. The solutions for the two fields are independent.

    msg_name only makes sense for errors that configure serr->port and
    serr->addr_offset. Test the first instead of skb->len. This also fixes
    another issue. saddr could hold the wrong data, as serr->addr_offset
    is not initialized in some code paths, pointing to the start of the
    network header. It is only valid when serr->port is set (non-zero).

    msg_control support differs between IPv4 and IPv6. IPv4 only honors
    requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The
    skb->len test can simply be removed, because skb->dev is also tested
    and never true for empty skbs. IPv6 honors requests for all errors
    aside from local errors and timestamps on empty skbs.

    In both cases, make the policy more explicit by moving this logic to
    a new function that decides whether to process msg_control and that
    optionally prepares the necessary fields in skb->cb[]. After this
    change, the IPv4 and IPv6 paths are more similar.

    The last case is rxrpc. Here, simply refine to only match timestamps.

    Fixes: 49ca0d8bfaf3 ("net-timestamp: no-payload option")

    Reported-by: Jan Niehusmann
    Signed-off-by: Willem de Bruijn

    ----

    Changes
    v1->v2
    - fix local origin test inversion in ip6_datagram_support_cmsg
    - make v4 and v6 code paths more similar by introducing analogous
    ipv4_datagram_support_cmsg
    - fix compile bug in rxrpc
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

03 Feb, 2015

1 commit

  • Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
    timestamps, this loops timestamps on top of empty packets.

    Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
    cmsg reception (aside from timestamps) are no longer possible. This
    works together with a follow on patch that allows administrators to
    only allow tx timestamping if it does not loop payload or metadata.

    Signed-off-by: Willem de Bruijn

    ----

    Changes (rfc -> v1)
    - add documentation
    - remove unnecessary skb->len test (thanks to Richard Cochran)
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Jan, 2015

1 commit


16 Jan, 2015

1 commit

  • The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
    structure is defined and allocated on the stack as

    struct {
    struct sock_extended_err ee;
    struct sockaddr_in(6) offender;
    } errhdr;

    The second part is only initialized for certain SO_EE_ORIGIN values.
    Always initialize it completely.

    An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
    would return uninitialized bytes.

    Signed-off-by: Willem de Bruijn

    ----

    Also verified that there is no padding between errhdr.ee and
    errhdr.offender that could leak additional kernel data.
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

06 Jan, 2015

3 commits

  • Add ip_cmsg_recv_offset function which takes an offset argument
    that indicates the starting offset in skb where data is being received
    from. This will be useful in the case of UDP and provided checksum
    to user space.

    ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
    zero.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add ip_cmsg_recv_offset function which takes an offset argument
    that indicates the starting offset in skb where data is being received
    from. This will be useful in the case of UDP and provided checksum
    to user space.

    ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
    zero.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
    they can be referenced in other source files.

    Restructure ip_cmsg_recv to not go through flags using shift, check
    for flags by 'and'. This eliminates both the shift and a conditional
    per flag check.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

11 Dec, 2014

1 commit


09 Dec, 2014

2 commits

  • Allow reading of timestamps and cmsg at the same time on all relevant
    socket families. One use is to correlate timestamps with egress
    device, by asking for cmsg IP_PKTINFO.

    on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
    avoid changing legacy expectations, only do so if the caller sets a
    new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.

    on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
    returned for all origins. only change is to set ifindex, which is
    not initialized for all error origins.

    In both cases, only generate the pktinfo message if an ifindex is
    known. This is not the case for ACK timestamps.

    The difference between the protocol families is probably a historical
    accident as a result of the different conditions for generating cmsg
    in the relevant ip(v6)_recv_error function:

    ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
    ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {

    At one time, this was the same test bar for the ICMP/ICMP6
    distinction. This is no longer true.

    Signed-off-by: Willem de Bruijn

    ----

    Changes
    v1 -> v2
    large rewrite
    - integrate with existing pktinfo cmsg generation code
    - on ipv4: only send with new flag, to maintain legacy behavior
    - on ipv6: send at most a single pktinfo cmsg
    - on ipv6: initialize fields if not yet initialized

    The recv cmsg interfaces are also relevant to the discussion of
    whether looping packet headers is problematic. For v6, cmsgs that
    identify many headers are already returned. This patch expands
    that to v4. If it sounds reasonable, I will follow with patches

    1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
    (http://patchwork.ozlabs.org/patch/366967/)
    2. sysctl to conditionally drop all timestamps that have payload or
    cmsg from users without CAP_NET_RAW.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • One line change, in response to catching an occurrence of this bug.
    See also fix f4713a3dfad0 ("net-timestamp: make tcp_recvmsg call ...")

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

14 Nov, 2014

1 commit


12 Nov, 2014

1 commit

  • Use IS_ENABLED(CONFIG_IPV6), to enable this code if IPv6 is
    a module.

    Signed-off-by: Eric Dumazet
    Fixes: c8e6ad0829a7 ("ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg")
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Nov, 2014

1 commit

  • This encapsulates all of the skb_copy_datagram_iovec() callers
    with call argument signature "skb, offset, msghdr->msg_iov, length".

    When we move to iov_iters in the networking, the iov_iter object will
    sit in the msghdr.

    Having a helper like this means there will be less places to touch
    during that transformation.

    Based upon descriptions and patch from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Sep, 2014

1 commit

  • Remove one sparse warning :
    net/ipv4/ip_sockglue.c:328:22: warning: incorrect type in assignment (different address spaces)
    net/ipv4/ip_sockglue.c:328:22: expected struct ip_ra_chain [noderef] *next
    net/ipv4/ip_sockglue.c:328:22: got struct ip_ra_chain *[assigned] ra

    And replace one rcu_assign_ptr() by RCU_INIT_POINTER() where applicable.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2014

1 commit

  • sk->sk_error_queue is dequeued in four locations. All share the
    exact same logic. Deduplicate.

    Also collapse the two critical sections for dequeue (at the top of
    the recv handler) and signal (at the bottom).

    This moves signal generation for the next packet forward, which should
    be harmless.

    It also changes the behavior if the recv handler exits early with an
    error. Previously, a signal for follow-up packets on the errqueue
    would then not be scheduled. The new behavior, to always signal, is
    arguably a bug fix.

    For rxrpc, the change causes the same function to be called repeatedly
    for each queued packet (because the recv handler == sk_error_report).
    It is likely that all packets will fail for the same reason (e.g.,
    memory exhaustion).

    This code runs without sk_lock held, so it is not safe to trust that
    sk->sk_err is immutable inbetween releasing q->lock and the subsequent
    test. Introduce int err just to avoid this potential race.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Jul, 2014

1 commit


27 Feb, 2014

1 commit

  • IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
    generation of fragments if the interface mtu is exceeded, it is very
    hard to make use of this option in already deployed name server software
    for which I introduced this option.

    This patch adds yet another new IP_MTU_DISCOVER option to not honor any
    path mtu information and not accepting new icmp notifications destined for
    the socket this option is enabled on. But we allow outgoing fragmentation
    in case the packet size exceeds the outgoing interface mtu.

    As such this new option can be used as a drop-in replacement for
    IP_PMTUDISC_DONT, which is currently in use by most name server software
    making the adoption of this option very smooth and easy.

    The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
    ignoring incoming path MTU updates and not honoring discovered path MTUs
    in the output path.

    Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
    Cc: Florian Weimer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

20 Feb, 2014

1 commit

  • In case we decide in udp6_sendmsg to send the packet down the ipv4
    udp_sendmsg path because the destination is either of family AF_INET or
    the destination is an ipv4 mapped ipv6 address, we don't honor the
    maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.

    We simply can check for this option in ip_cmsg_send because no calls to
    ipv6 module functions are needed to do so.

    Reported-by: Gert Doering
    Cc: Tore Anderson
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

20 Jan, 2014

1 commit

  • We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
    for IPv4 datagrams on IPv6 sockets.

    This patch splits the ip6_datagram_recv_ctl into two functions, one
    which handles both protocol families, AF_INET and AF_INET6, while the
    ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.

    ip6_datagram_recv_*_ctl never reported back any errors, so we can make
    them return void. Also provide a helper for protocols which don't offer dual
    personality to further use ip6_datagram_recv_ctl, which is exported to
    modules.

    I needed to shuffle the code for ping around a bit to make it easier to
    implement dual personality for ping ipv6 sockets in future.

    Reported-by: Gert Doering
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

19 Jan, 2014

1 commit

  • This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
    handler msg_name and msg_namelen logic").

    DECLARE_SOCKADDR validates that the structure we use for writing the
    name information to is not larger than the buffer which is reserved
    for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
    consistently in sendmsg code paths.

    Signed-off-by: Steffen Hurrle
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Steffen Hurrle
     

11 Dec, 2013

1 commit


24 Nov, 2013

1 commit

  • Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
    of uninitialized memory to user in recv syscalls") conditionally updated
    addr_len if the msg_name is written to. The recv_error and rxpmtu
    functions relied on the recvmsg functions to set up addr_len before.

    As this does not happen any more we have to pass addr_len to those
    functions as well and set it to the size of the corresponding sockaddr
    length.

    This broke traceroute and such.

    Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
    Reported-by: Brad Spengler
    Reported-by: Tom Labanowski
    Cc: mpb
    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa