13 Aug, 2013

1 commit


12 Aug, 2013

1 commit

  • commit e370a723632 ("af_unix: improve STREAM behavior with fragmented
    memory") added a bug on large send() because the
    skb_copy_datagram_from_iovec() call always start from the beginning
    of iovec.

    We must instead use the @sent variable to properly skip the
    already processed part.

    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2013

11 commits

  • Adding paged frags skbs to af_unix sockets introduced a performance
    regression on large sends because of additional page allocations, even
    if each skb could carry at least 100% more payload than before.

    We can instruct sock_alloc_send_pskb() to attempt high order
    allocations.

    Most of the time, it does a single page allocation instead of 8.

    I added an additional parameter to sock_alloc_send_pskb() to
    let other users to opt-in for this new feature on followup patches.

    Tested:

    Before patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 46861.15

    After patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 57981.11

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • unix_stream_sendmsg() currently uses order-2 allocations,
    and we had numerous reports this can fail.

    The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
    not helping.

    This patch extends the work done in commit eb6a24816b247c
    ("af_unix: reduce high order page allocations) for
    datagram sockets.

    This opens the possibility of zero copy IO (splice() and
    friends)

    The trick is to not use skb_pull() anymore in recvmsg() path,
    and instead add a @consumed field in UNIXCB() to track amount
    of already read payload in the skb.

    There is a performance regression for large sends
    because of extra page allocations that will be addressed
    in a follow-up patch, allowing sock_alloc_send_pskb()
    to attempt high order page allocations.

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Encrypt the cookie with both server and client IPv4 addresses,
    such that multi-homed server will grant different cookies
    based on both the source and destination IPs. No client change
    is needed since cookie is opaque to the client.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This reverts commit cda5f98e36576596b9230483ec52bff3cc97eb21.

    As per Vlad's request.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • John W. Linville
     
  • John W. Linville
     
  • With the restructuring of the lksctp.org site, we only allow bug
    reports through the SCTP mailing list linux-sctp@vger.kernel.org,
    not via SF, as SF is only used for web hosting and nothing more.
    While at it, also remove the obvious statement that bugs will be
    fixed and incooperated into the kernel.

    Signed-off-by: Daniel Borkmann
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Get rid of the last module parameter for SCTP and make this
    configurable via sysctl for SCTP like all the rest of SCTP's
    configuration knobs.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Adds the new procfs knobs:

    /proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval
    /proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval

    Which will allow userspace configuration of the IGMP unsolicited report
    interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2
    and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376.

    Background:

    If an IGMP join packet is lost you will not receive data sent to the
    multicast group so if no data arrives from that multicast group in a
    period of time after the IGMP join a second IGMP join will be sent. The
    delay between joins is the "IGMP Unsolicited Report Interval".

    Prior to this patch this value was hard coded in the kernel to 10s for
    IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as
    IPTV as it can cause channel change to be slow in the presence of packet
    loss.

    This patch allows the value to be overridden from userspace for both
    IGMPv2 and IGMPv3 such that it can be tuned accoding to the network.

    Tested with Wireshark and a simple program to join a (non-existent)
    multicast group. The distribution of timings for the second join differ
    based upon setting the procfs knobs.

    igmpvX_unsolicited_report_interval is intended to follow the pattern
    established by force_igmp_version, and while a procfs entry has been added
    a corresponding sysctl knob has not as it is my understanding that sysctl
    is deprecated[1].

    [1]: http://lwn.net/Articles/247243/

    Signed-off-by: William Manley
    Acked-by: Hannes Frederic Sowa
    Acked-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    William Manley
     
  • The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the
    IGMP protocol version to use to be explicitly set. As a side effect this
    caused the routing cache to be flushed as it was declared as a
    DEVINET_SYSCTL_FLUSHING_ENTRY. Flushing is unnecessary and this patch
    makes it so flushing does not occur.

    Requested by Hannes Frederic Sowa as he was reviewing other patches
    adding procfs entries.

    Suggested-by: Hannes Frederic Sowa
    Signed-off-by: William Manley
    Acked-by: Hannes Frederic Sowa
    Acked-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    William Manley
     
  • If an IGMP join packet is lost you will not receive data sent to the
    multicast group so if no data arrives from that multicast group in a
    period of time after the IGMP join a second IGMP join will be sent. The
    delay between joins is the "IGMP Unsolicited Report Interval".

    Previously this value was hard coded to be chosen randomly between 0-10s.
    This can be too long for some use-cases, such as IPTV as it can cause
    channel change to be slow in the presence of packet loss.

    The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in
    IGMPv3 RFC3376. This patch makes the kernel use the 1s value from the
    later RFC if we are operating in IGMPv3 mode. IGMPv2 behaviour is
    unaffected.

    Tested with Wireshark and a simple program to join a (non-existent)
    multicast group. The distribution of timings for the second join differ
    based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version.

    Signed-off-by: William Manley
    Acked-by: Hannes Frederic Sowa
    Acked-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    William Manley
     

09 Aug, 2013

1 commit

  • With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
    counters do not count the number of frames, but number of aggregated
    segments.

    Its probably too late to change this now.

    This patch adds four new counters, tracking number of frames, regardless
    of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

    Ip[6]NoECTPkts : Number of packets received with NOECT
    Ip[6]ECT1Pkts : Number of packets received with ECT(1)
    Ip[6]ECT0Pkts : Number of packets received with ECT(0)
    Ip[6]CEPkts : Number of packets received with Congestion Experienced

    lph37:~# nstat | egrep "Pkts|InReceive"
    IpInReceives 1634137 0.0
    Ip6InReceives 3714107 0.0
    Ip6InNoECTPkts 19205 0.0
    Ip6InECT0Pkts 52651828 0.0
    IpExtInNoECTPkts 33630 0.0
    IpExtInECT0Pkts 15581379 0.0
    IpExtInCEPkts 6 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Aug, 2013

8 commits


05 Aug, 2013

1 commit


04 Aug, 2013

5 commits

  • Merge net into net-next to setup some infrastructure Eric
    Dumazet needs for usbnet changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) Don't ignore user initiated wireless regulatory settings on cards
    with custom regulatory domains, from Arik Nemtsov.

    2) Fix length check of bluetooth information responses, from Jaganath
    Kanakkassery.

    3) Fix misuse of PTR_ERR in btusb, from Adam Lee.

    4) Handle rfkill properly while iwlwifi devices are offline, from
    Emmanuel Grumbach.

    5) Fix r815x devices DMA'ing to stack buffers, from Hayes Wang.

    6) Kernel info leak in ATM packet scheduler, from Dan Carpenter.

    7) 8139cp doesn't check for DMA mapping errors, from Neil Horman.

    8) Fix bridge multicast code to not snoop when no querier exists,
    otherwise mutlicast traffic is lost. From Linus Lüssing.

    9) Avoid soft lockups in fib6_run_gc(), from Michal Kubecek.

    10) Fix races in automatic address asignment on ipv6, which can result
    in incorrect lifetime assignments. From Jiri Benc.

    11) Cure build bustage when CONFIG_NET_LL_RX_POLL is not set and rename
    it CONFIG_NET_RX_BUSY_POLL to eliminate the last reference to the
    original naming of this feature. From Cong Wang.

    12) Fix crash in TIPC when server socket creation fails, from Ying Xue.

    13) macvlan_changelink() silently succeeds when it shouldn't, from
    Michael S Tsirkin.

    14) HTB packet scheduler can crash due to sign extension, fix from
    Stephen Hemminger.

    15) With the cable unplugged, r8169 prints out a message every 10
    seconds, make it netif_dbg() instead of netif_warn(). From Peter
    Wu.

    16) Fix memory leak in rtm_to_ifaddr(), from Daniel Borkmann.

    17) sis900 gets spurious TX queue timeouts due to mismanagement of link
    carrier state, from Denis Kirjanov.

    18) Validate somaxconn sysctl to make sure it fits inside of a u16.
    From Roman Gushchin.

    19) Fix MAC address filtering on qlcnic, from Shahed Shaikh.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (68 commits)
    qlcnic: Fix for flash update failure on 83xx adapter
    qlcnic: Fix link speed and duplex display for 83xx adapter
    qlcnic: Fix link speed display for 82xx adapter
    qlcnic: Fix external loopback test.
    qlcnic: Removed adapter series name from warning messages.
    qlcnic: Free up memory in error path.
    qlcnic: Fix ingress MAC learning
    qlcnic: Fix MAC address filter issue on 82xx adapter
    net: ethernet: davinci_emac: drop IRQF_DISABLED
    netlabel: use domain based selectors when address based selectors are not available
    net: check net.core.somaxconn sysctl values
    sis900: Fix the tx queue timeout issue
    net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing fails
    r8169: remove "PHY reset until link up" log spam
    net: ethernet: cpsw: drop IRQF_DISABLED
    htb: fix sign extension bug
    macvlan: handle set_promiscuity failures
    macvlan: better mode validation
    tipc: fix oops when creating server socket fails
    net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL
    ...

    Linus Torvalds
     
  • Pull nfsd bugfixes from Bruce Fields:
    "Most of this is due to a screwup on my part -- some gss-proxy crashes
    got fixed before the merge window but somehow never made it out of a
    temporary git repo on my laptop...."

    * 'for-3.11' of git://linux-nfs.org/~bfields/linux:
    svcrpc: set cr_gss_mech from gss-proxy as well as legacy upcall
    svcrpc: fix kfree oops in gss-proxy code
    svcrpc: fix gss-proxy xdr decoding oops
    svcrpc: fix gss_rpc_upcall create error
    NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

    Linus Torvalds
     
  • This change brings the suppressor attribute names into line; it also changes
    the data types to provide a more consistent interface.

    While -1 indicates that the suppressor is not enabled, values >= 0 for
    suppress_prefixlen or suppress_ifgroup reject routing decisions violating the
    constraint.

    This changes the previously presented behaviour of suppress_prefixlen, where a
    prefix length _less_ than the attribute value was rejected. After this change,
    a prefix length less than *or* equal to the value is considered a violation of
    the rule constraint.

    It also changes the default values for default and newly added rules (disabling
    any suppression for those).

    Signed-off-by: Stefan Tomanek
    Signed-off-by: David S. Miller

    Stefan Tomanek
     
  • This patch cleanup 2 points for the usage of vlan_dev_priv(dev):
    * In vlan_dev.c/vlan_dev_hard_header, we should use the var *vlan directly
    after grabing the pointer at the beginning with
    *vlan = vlan_dev_priv(dev);
    when we need to access the fields of *vlan.
    * In vlan.c/register_vlan_device, add the var *vlan pointer
    struct vlan_dev_priv *vlan;
    to cleanup the code to access the fields of vlan_dev_priv(new_dev).

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: David S. Miller

    Wang Sheng-Hui
     

03 Aug, 2013

12 commits

  • NetLabel has the ability to selectively assign network security labels
    to outbound traffic based on either the LSM's "domain" (different for
    each LSM), the network destination, or a combination of both. Depending
    on the type of traffic, local or forwarded, and the type of traffic
    selector, domain or address based, different hooks are used to label the
    traffic; the goal being minimal overhead.

    Unfortunately, there is a bug such that a system using NetLabel domain
    based traffic selectors does not correctly label outbound local traffic
    that is not assigned to a socket. The issue is that in these cases
    the associated NetLabel hook only looks at the address based selectors
    and not the domain based selectors. This patch corrects this by
    checking both the domain and address based selectors so that the correct
    labeling is applied, regardless of the configuration type.

    In order to acomplish this fix, this patch also simplifies some of the
    NetLabel domainhash structures to use a more common outbound traffic
    mapping type: struct netlbl_dommap_def. This simplifies some of the code
    in this patch and paves the way for further simplifications in the
    future.

    Signed-off-by: Paul Moore
    Signed-off-by: David S. Miller

    Paul Moore
     
  • dev->ndo_neigh_setup() might need some of the values of neigh_parms, so
    populate them before calling it.

    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • Variable ptr is being assigned, but never used, so just remove it.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This change adds the ability to suppress a routing decision based upon the
    interface group the selected interface belongs to. This allows it to
    exclude specific devices from a routing decision.

    Signed-off-by: Stefan Tomanek
    Signed-off-by: David S. Miller

    Stefan Tomanek
     
  • It's possible to assign an invalid value to the net.core.somaxconn
    sysctl variable, because there is no checks at all.

    The sk_max_ack_backlog field of the sock structure is defined as
    unsigned short. Therefore, the backlog argument in inet_listen()
    shouldn't exceed USHRT_MAX. The backlog argument in the listen() syscall
    is truncated to the somaxconn value. So, the somaxconn value shouldn't
    exceed 65535 (USHRT_MAX).
    Also, negative values of somaxconn are meaningless.

    before:
    $ sysctl -w net.core.somaxconn=256
    net.core.somaxconn = 256
    $ sysctl -w net.core.somaxconn=65536
    net.core.somaxconn = 65536
    $ sysctl -w net.core.somaxconn=-100
    net.core.somaxconn = -100

    after:
    $ sysctl -w net.core.somaxconn=256
    net.core.somaxconn = 256
    $ sysctl -w net.core.somaxconn=65536
    error: "Invalid argument" setting key "net.core.somaxconn"
    $ sysctl -w net.core.somaxconn=-100
    error: "Invalid argument" setting key "net.core.somaxconn"

    Based on a prior patch from Changli Gao.

    Signed-off-by: Roman Gushchin
    Reported-by: Changli Gao
    Suggested-by: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Roman Gushchin
     
  • By using sizeof(_hdr), net/ipv6/raw.c:icmpv6_filter implicitly assumes
    that any valid ICMPv6 message is at least eight bytes long, i.e., that
    the message body is at least four bytes.

    The DIS message of RPL (RFC 6550 section 6.2, from the 6LoWPAN world),
    has a minimum length of only six bytes, and is thus blocked by
    icmpv6_filter.

    RFC 4443 seems to allow even a zero-sized body, making the minimum
    allowable message size four bytes.

    Signed-off-by: Werner Almesberger
    Signed-off-by: David S. Miller

    Werner Almesberger
     
  • "_hdr" should hold the ICMPv6 header while "hdr" is the pointer to it.
    This worked by accident.

    Signed-off-by: Werner Almesberger
    Signed-off-by: David S. Miller

    Werner Almesberger
     
  • For ethernet frames, eth_type_trans() already parses the header, so one
    can skip this when checking the frame size.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • Since tpacket_fill_skb() parses the protocol field in ethernet frames'
    headers, it's easy to see if any passed frame is a VLAN one and account
    for the extended size.

    But as the real protocol does not turn up before tpacket_fill_skb()
    runs which in turn also checks the frame length, move the max frame
    length calculation into the function.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • This may be necessary when the SKB is passed to other layers on the go,
    which check the protocol field on their own. An example is a VLAN packet
    sent out using AF_PACKET on a bridge interface. The bridging code checks
    the SKB size, accounting for any VLAN header only if the protocol field
    is set accordingly.

    Note that eth_type_trans() sets skb->dev to the passed argument, so this
    can be skipped in packet_snd() for ethernet frames, as well.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • Commit 5c766d642 ("ipv4: introduce address lifetime") leaves the ifa
    resource that was allocated via inet_alloc_ifa() unfreed when returning
    the function with -EINVAL. Thus, free it first via inet_free_ifa().

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When userspace passes a large priority value
    the assignment of the unsigned value hopt->prio
    to signed int cl->prio causes cl->prio to become negative and the
    comparison is with TC_HTB_NUMPRIO is always false.

    The result is that HTB crashes by referencing outside
    the array when processing packets. With this patch the large value
    wraps around like other values outside the normal range.

    See: https://bugzilla.kernel.org/show_bug.cgi?id=60669

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger