15 Dec, 2015

1 commit

  • [ Upstream commit 7d267278a9ece963d77eefec61630223fce08c6c ]

    Rainer Weikusat writes:
    An AF_UNIX datagram socket being the client in an n:1 association with
    some server socket is only allowed to send messages to the server if the
    receive queue of this socket contains at most sk_max_ack_backlog
    datagrams. This implies that prospective writers might be forced to go
    to sleep despite none of the message presently enqueued on the server
    receive queue were sent by them. In order to ensure that these will be
    woken up once space becomes again available, the present unix_dgram_poll
    routine does a second sock_poll_wait call with the peer_wait wait queue
    of the server socket as queue argument (unix_dgram_recvmsg does a wake
    up on this queue after a datagram was received). This is inherently
    problematic because the server socket is only guaranteed to remain alive
    for as long as the client still holds a reference to it. In case the
    connection is dissolved via connect or by the dead peer detection logic
    in unix_dgram_sendmsg, the server socket may be freed despite "the
    polling mechanism" (in particular, epoll) still has a pointer to the
    corresponding peer_wait queue. There's no way to forcibly deregister a
    wait queue with epoll.

    Based on an idea by Jason Baron, the patch below changes the code such
    that a wait_queue_t belonging to the client socket is enqueued on the
    peer_wait queue of the server whenever the peer receive queue full
    condition is detected by either a sendmsg or a poll. A wake up on the
    peer queue is then relayed to the ordinary wait queue of the client
    socket via wake function. The connection to the peer wait queue is again
    dissolved if either a wake up is about to be relayed or the client
    socket reconnects or a dead peer is detected or the client socket is
    itself closed. This enables removing the second sock_poll_wait from
    unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
    that no blocked writer sleeps forever.

    Signed-off-by: Rainer Weikusat
    Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
    Reviewed-by: Jason Baron
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Rainer Weikusat
     

27 Oct, 2015

2 commits

  • [ Upstream commit e9193d60d363e4dff75ff6d43a48f22be26d59c7 ]

    Now send with MSG_PEEK can return data from multiple SKBs.

    Unfortunately we take into account the peek offset for each skb,
    that is wrong. We need to apply the peek offset only once.

    In addition, the peek offset should be used only if MSG_PEEK is set.

    Cc: "David S. Miller" (maintainer:NETWORKING
    Cc: Eric Dumazet (commit_signer:1/14=7%)
    Cc: Aaron Conole
    Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag")
    Signed-off-by: Andrey Vagin
    Tested-by: Aaron Conole
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andrey Vagin
     
  • [ Upstream commit 9f389e35674f5b086edd70ed524ca0f287259725 ]

    AF_UNIX sockets now return multiple skbs from recv() when MSG_PEEK flag
    is set.

    This is referenced in kernel bugzilla #12323 @
    https://bugzilla.kernel.org/show_bug.cgi?id=12323

    As described both in the BZ and lkml thread @
    http://lkml.org/lkml/2008/1/8/444 calling recv() with MSG_PEEK on an
    AF_UNIX socket only reads a single skb, where the desired effect is
    to return as much skb data has been queued, until hitting the recv
    buffer size (whichever comes first).

    The modified MSG_PEEK path will now move to the next skb in the tree
    and jump to the again: label, rather than following the natural loop
    structure. This requires duplicating some of the loop head actions.

    This was tested using the python socketpair python code attached to
    the bugzilla issue.

    Signed-off-by: Aaron Conole
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Aaron Conole
     

27 May, 2015

1 commit


28 Apr, 2015

1 commit

  • Pull networking fixes from David Miller:

    1) mlx4 doesn't check fully for supported valid RSS hash function, fix
    from Amir Vadai

    2) Off by one in ibmveth_change_mtu(), from David Gibson

    3) Prevent altera chip from reporting false error interrupts in some
    circumstances, from Chee Nouk Phoon

    4) Get rid of that stupid endless loop trying to allocate a FIN packet
    in TCP, and in the process kill deadlocks. From Eric Dumazet

    5) Fix get_rps_cpus() crash due to wrong invalid-cpu value, also from
    Eric Dumazet

    6) Fix two bugs in async rhashtable resizing, from Thomas Graf

    7) Fix topology server listener socket namespace bug in TIPC, from Ying
    Xue

    8) Add some missing HAS_DMA kconfig dependencies, from Geert
    Uytterhoeven

    9) bgmac driver intends to force re-polling but does so by returning
    the wrong value from it's ->poll() handler. Fix from Rafał Miłecki

    10) When the creater of an rhashtable configures a max size for it,
    don't bark in the logs and drop insertions when that is exceeded.
    Fix from Johannes Berg

    11) Recover from out of order packets in ppp mppe properly, from Sylvain
    Rochet

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
    bnx2x: really disable TPA if 'disable_tpa' option is set
    net:treewide: Fix typo in drivers/net
    net/mlx4_en: Prevent setting invalid RSS hash function
    mdio-mux-gpio: use new gpiod_get_array and gpiod_put_array functions
    netfilter; Add some missing default cases to switch statements in nft_reject.
    ppp: mppe: discard late packet in stateless mode
    ppp: mppe: sanity error path rework
    net/bonding: Make DRV macros private
    net: rfs: fix crash in get_rps_cpus()
    altera tse: add support for fixed-links.
    pxa168: fix double deallocation of managed resources
    net: fix crash in build_skb()
    net: eth: altera: Resolve false errors from MSGDMA to TSE
    ehea: Fix memory hook reference counting crashes
    net/tg3: Release IRQs on permanent error
    net: mdio-gpio: support access that may sleep
    inet: fix possible panic in reqsk_queue_unlink()
    rhashtable: don't attempt to grow when at max_size
    bgmac: fix requests for extra polling calls from NAPI
    tcp: avoid looping in tcp_send_fin()
    ...

    Linus Torvalds
     

24 Apr, 2015

1 commit


16 Apr, 2015

2 commits


03 Mar, 2015

1 commit

  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

29 Jan, 2015

1 commit

  • The sock_iocb structure is allocate on stack for each read/write-like
    operation on sockets, and contains various fields of which only the
    embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
    Get rid of the sock_iocb and put a msghdr directly on the stack and pass
    the scm_cookie explicitly to netlink_mmap_sendmsg.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

10 Dec, 2014

1 commit

  • Note that the code _using_ ->msg_iter at that point will be very
    unhappy with anything other than unshifted iovec-backed iov_iter.
    We still need to convert users to proper primitives.

    Signed-off-by: Al Viro

    Al Viro
     

24 Nov, 2014

1 commit


06 Nov, 2014

1 commit

  • This encapsulates all of the skb_copy_datagram_iovec() callers
    with call argument signature "skb, offset, msghdr->msg_iov, length".

    When we move to iov_iters in the networking, the iov_iter object will
    sit in the msghdr.

    Having a helper like this means there will be less places to touch
    during that transformation.

    Based upon descriptions and patch from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Oct, 2014

1 commit


13 Jun, 2014

1 commit

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

17 May, 2014

1 commit


18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Mar, 2014

1 commit

  • Some applications didn't expect recvmsg() on a non blocking socket
    could return -EINTR. This possibility was added as a side effect
    of commit b3ca9b02b00704 ("net: fix multithreaded signal handling in
    unix recv routines").

    To hit this bug, you need to be a bit unlucky, as the u->readlock
    mutex is usually held for very small periods.

    Fixes: b3ca9b02b00704 ("net: fix multithreaded signal handling in unix recv routines")
    Signed-off-by: Eric Dumazet
    Cc: Rainer Weikusat
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2014

1 commit

  • The unix socket code is using the result of csum_partial to
    hash into a lookup table:

    unix_hash_fold(csum_partial(sunaddr, len, 0));

    csum_partial is only guaranteed to produce something that can be
    folded into a checksum, as its prototype explains:

    * returns a 32-bit number suitable for feeding into itself
    * or csum_tcpudp_magic

    The 32bit value should not be used directly.

    Depending on the alignment, the ppc64 csum_partial will return
    different 32bit partial checksums that will fold into the same
    16bit checksum.

    This difference causes the following testcase (courtesy of
    Gustavo) to sometimes fail:

    #include
    #include

    int main()
    {
    int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

    int i = 1;
    setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

    struct sockaddr addr;
    addr.sa_family = AF_LOCAL;
    bind(fd, &addr, 2);

    listen(fd, 128);

    struct sockaddr_storage ss;
    socklen_t sslen = (socklen_t)sizeof(ss);
    getsockname(fd, (struct sockaddr*)&ss, &sslen);

    fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

    if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
    perror(NULL);
    return 1;
    }
    printf("OK\n");
    return 0;
    }

    As suggested by davem, fix this by using csum_fold to fold the
    partial 32bit checksum into a 16bit checksum before using it.

    Signed-off-by: Anton Blanchard
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Anton Blanchard
     

19 Jan, 2014

1 commit

  • This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
    handler msg_name and msg_namelen logic").

    DECLARE_SOCKADDR validates that the structure we use for writing the
    name information to is not larger than the buffer which is reserved
    for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
    consistently in sendmsg code paths.

    Signed-off-by: Steffen Hurrle
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Steffen Hurrle
     

19 Dec, 2013

1 commit


18 Dec, 2013

1 commit

  • This is similar to the set_peek_off patch where calling bind while the
    socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
    spew after a while.

    This is also the last place that did a straightforward mutex_lock(), so
    there shouldn't be any more of these patches.

    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

11 Dec, 2013

1 commit

  • unix_dgram_recvmsg() will hold the readlock of the socket until recv
    is complete.

    In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
    unix_dgram_recvmsg() will complete (which can take a while) without allowing
    us to break out of it, triggering a hung task spew.

    Instead, allow set_peek_off to fail, this way userspace will not hang.

    Signed-off-by: Sasha Levin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Sasha Levin
     

07 Dec, 2013

1 commit


21 Nov, 2013

1 commit


20 Oct, 2013

1 commit

  • In the case of credentials passing in unix stream sockets (dgram
    sockets seem not affected), we get a rather sparse race after
    commit 16e5726 ("af_unix: dont send SCM_CREDENTIALS by default").

    We have a stream server on receiver side that requests credential
    passing from senders (e.g. nc -U). Since we need to set SO_PASSCRED
    on each spawned/accepted socket on server side to 1 first (as it's
    not inherited), it can happen that in the time between accept() and
    setsockopt() we get interrupted, the sender is being scheduled and
    continues with passing data to our receiver. At that time SO_PASSCRED
    is neither set on sender nor receiver side, hence in cmsg's
    SCM_CREDENTIALS we get eventually pid:0, uid:65534, gid:65534
    (== overflow{u,g}id) instead of what we actually would like to see.

    On the sender side, here nc -U, the tests in maybe_add_creds()
    invoked through unix_stream_sendmsg() would fail, as at that exact
    time, as mentioned, the sender has neither SO_PASSCRED on his side
    nor sees it on the server side, and we have a valid 'other' socket
    in place. Thus, sender believes it would just look like a normal
    connection, not needing/requesting SO_PASSCRED at that time.

    As reverting 16e5726 would not be an option due to the significant
    performance regression reported when having creds always passed,
    one way/trade-off to prevent that would be to set SO_PASSCRED on
    the listener socket and allow inheriting these flags to the spawned
    socket on server side in accept(). It seems also logical to do so
    if we'd tell the listener socket to pass those flags onwards, and
    would fix the race.

    Before, strace:

    recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
    msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
    cmsg_type=SCM_CREDENTIALS{pid=0, uid=65534, gid=65534}},
    msg_flags=0}, 0) = 5

    After, strace:

    recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
    msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
    cmsg_type=SCM_CREDENTIALS{pid=11580, uid=1000, gid=1000}},
    msg_flags=0}, 0) = 5

    Signed-off-by: Daniel Borkmann
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

03 Oct, 2013

1 commit

  • When filling the netlink message we miss to wipe the pad field,
    therefore leak one byte of heap memory to userland. Fix this by
    setting pad to 0.

    Signed-off-by: Mathias Krause
    Signed-off-by: David S. Miller

    Mathias Krause
     

12 Aug, 2013

1 commit

  • commit e370a723632 ("af_unix: improve STREAM behavior with fragmented
    memory") added a bug on large send() because the
    skb_copy_datagram_from_iovec() call always start from the beginning
    of iovec.

    We must instead use the @sent variable to properly skip the
    already processed part.

    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2013

2 commits

  • Adding paged frags skbs to af_unix sockets introduced a performance
    regression on large sends because of additional page allocations, even
    if each skb could carry at least 100% more payload than before.

    We can instruct sock_alloc_send_pskb() to attempt high order
    allocations.

    Most of the time, it does a single page allocation instead of 8.

    I added an additional parameter to sock_alloc_send_pskb() to
    let other users to opt-in for this new feature on followup patches.

    Tested:

    Before patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 46861.15

    After patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 57981.11

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • unix_stream_sendmsg() currently uses order-2 allocations,
    and we had numerous reports this can fail.

    The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
    not helping.

    This patch extends the work done in commit eb6a24816b247c
    ("af_unix: reduce high order page allocations) for
    datagram sockets.

    This opens the possibility of zero copy IO (splice() and
    friends)

    The trick is to not use skb_pull() anymore in recvmsg() path,
    and instead add a @consumed field in UNIXCB() to track amount
    of already read payload in the skb.

    There is a performance regression for large sends
    because of extra page allocations that will be addressed
    in a follow-up patch, allowing sock_alloc_send_pskb()
    to attempt high order page allocations.

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jul, 2013

1 commit

  • Pull networking updates from David Miller:
    "This is a re-do of the net-next pull request for the current merge
    window. The only difference from the one I made the other day is that
    this has Eliezer's interface renames and the timeout handling changes
    made based upon your feedback, as well as a few bug fixes that have
    trickeled in.

    Highlights:

    1) Low latency device polling, eliminating the cost of interrupt
    handling and context switches. Allows direct polling of a network
    device from socket operations, such as recvmsg() and poll().

    Currently ixgbe, mlx4, and bnx2x support this feature.

    Full high level description, performance numbers, and design in
    commit 0a4db187a999 ("Merge branch 'll_poll'")

    From Eliezer Tamir.

    2) With the routing cache removed, ip_check_mc_rcu() gets exercised
    more than ever before in the case where we have lots of multicast
    addresses. Use a hash table instead of a simple linked list, from
    Eric Dumazet.

    3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
    Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
    Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

    4) Support reporting the TUN device persist flag to userspace, from
    Pavel Emelyanov.

    5) Allow controlling network device VF link state using netlink, from
    Rony Efraim.

    6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

    7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
    Daniel Borkmann and Eric Dumazet.

    8) Allow controlling of TCP quickack behavior on a per-route basis,
    from Cong Wang.

    9) Several bug fixes and improvements to vxlan from Stephen
    Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
    support receiving on multiple UDP ports.

    10) Major cleanups, particular in the area of debugging and cookie
    lifetime handline, to the SCTP protocol code. From Daniel
    Borkmann.

    11) Allow packets to cross network namespaces when traversing tunnel
    devices. From Nicolas Dichtel.

    12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
    manner akin to how we monitor real network traffic via ptype_all.
    From Daniel Borkmann.

    13) Several bug fixes and improvements for the new alx device driver,
    from Johannes Berg.

    14) Fix scalability issues in the netem packet scheduler's time queue,
    by using an rbtree. From Eric Dumazet.

    15) Several bug fixes in TCP loss recovery handling, from Yuchung
    Cheng.

    16) Add support for GSO segmentation of MPLS packets, from Simon
    Horman.

    17) Make network notifiers have a real data type for the opaque
    pointer that's passed into them. Use this to properly handle
    network device flag changes in arp_netdev_event(). From Jiri
    Pirko and Timo Teräs.

    18) Convert several drivers over to module_pci_driver(), from Peter
    Huewe.

    19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
    O(1) calculation instead. From Eric Dumazet.

    20) Support setting of explicit tunnel peer addresses in ipv6, just
    like ipv4. From Nicolas Dichtel.

    21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

    22) Prevent a single high rate flow from overruning an individual cpu
    during RX packet processing via selective flow shedding. From
    Willem de Bruijn.

    23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
    Dumazet.

    24) Don't just drop GSO packets which are above the TBF scheduler's
    burst limit, chop them up so they are in-bounds instead. Also
    from Eric Dumazet.

    25) VLAN offloads are missed when configured on top of a bridge, fix
    from Vlad Yasevich.

    26) Support IPV6 in ping sockets. From Lorenzo Colitti.

    27) Receive flow steering targets should be updated at poll() time
    too, from David Majnemer.

    28) Fix several corner case regressions in PMTU/redirect handling due
    to the routing cache removal, from Timo Teräs.

    29) We have to be mindful of ipv4 mapped ipv6 sockets in
    upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

    30) Fix L2TP sequence number handling bugs, from James Chapman."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
    drivers/net: caif: fix wrong rtnl_is_locked() usage
    drivers/net: enic: release rtnl_lock on error-path
    vhost-net: fix use-after-free in vhost_net_flush
    net: mv643xx_eth: do not use port number as platform device id
    net: sctp: confirm route during forward progress
    virtio_net: fix race in RX VQ processing
    virtio: support unlocked queue poll
    net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
    Documentation: Fix references to defunct linux-net@vger.kernel.org
    net/fs: change busy poll time accounting
    net: rename low latency sockets functions to busy poll
    bridge: fix some kernel warning in multicast timer
    sfc: Fix memory leak when discarding scattered packets
    sit: fix tunnel update via netlink
    dt:net:stmmac: Add dt specific phy reset callback support.
    dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
    dt:net:stmmac: Allocate platform data only if its NULL.
    net:stmmac: fix memleak in the open method
    ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
    net: ipv6: fix wrong ping_v6_sendmsg return value
    ...

    Linus Torvalds
     

13 Jun, 2013

1 commit

  • Reduce the uses of this unnecessary typedef.

    Done via perl script:

    $ git grep --name-only -w ctl_table net | \
    xargs perl -p -i -e '\
    sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
    s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

    Reflow the modified lines that now exceed 80 columns.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

12 May, 2013

1 commit

  • Avoid waking up every thread sleeping in read call on an AF_UNIX
    socket during suspend and resume by calling a freezable blocking
    call. Previous patches modified the freezer to avoid sending
    wakeups to threads that are blocked in freezable blocking calls.

    This call was selected to be converted to a freezable call because
    it doesn't hold any locks or release any resources when interrupted
    that might be needed by another freezing task or a kernel driver
    during suspend, and is a common site where idle userspace tasks are
    blocked.

    Acked-by: Tejun Heo
    Signed-off-by: Colin Cross
    Signed-off-by: Rafael J. Wysocki

    Colin Cross
     

02 May, 2013

1 commit

  • Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
    uses 64bit instructions to manipulate them.
    If the 64bit word includes any atomic_t or spinlock_t, we can lose
    critical concurrent changes.

    This is happening in af_unix, where unix_sk(sk)->gc_candidate/
    gc_maybe_cycle/lock share the same 64bit word.

    This leads to fatal deadlock, as one/several cpus spin forever
    on a spinlock that will never be available again.

    A safer way would be to use a long to store flags.
    This way we are sure compiler/arch wont do bad things.

    As we own unix_gc_lock spinlock when clearing or setting bits,
    we can use the non atomic __set_bit()/__clear_bit().

    recursion_level can share the same 64bit location with the spinlock,
    as it is set only with this spinlock held.

    [1] bug fixed in gcc-4.8.0 :
    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

    Reported-by: Ambrose Feinstein
    Signed-off-by: Eric Dumazet
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Apr, 2013

2 commits


23 Apr, 2013

1 commit

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    drivers/net/ethernet/intel/igb/igb_main.c
    drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
    include/net/scm.h
    net/batman-adv/routing.c
    net/ipv4/tcp_input.c

    The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
    cleanup in net-next to now pass cred structs around.

    The be2net driver had a bug fix in 'net' that overlapped with the VLAN
    interface changes by Patrick McHardy in net-next.

    An IGB conflict existed because in 'net' the build_skb() support was
    reverted, and in 'net-next' there was a comment style fix within that
    code.

    Several batman-adv conflicts were resolved by making sure that all
    calls to batadv_is_my_mac() are changed to have a new bat_priv first
    argument.

    Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
    rewrite in 'net-next', mostly overlapping changes.

    Thanks to Stephen Rothwell and Antonio Quartulli for help with several
    of these merge resolutions.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Apr, 2013

1 commit

  • Now that uids and gids are completely encapsulated in kuid_t
    and kgid_t we no longer need to pass struct cred which allowed
    us to test both the uid and the user namespace for equality.

    Passing struct cred potentially allows us to pass the entire group
    list as BSD does but I don't believe the cost of cache line misses
    justifies retaining code for a future potential application.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman