02 Oct, 2009

1 commit

  • This patch against v2.6.31 adds support for route lookup using sk_mark in some
    more places. The benefits from this patch are the following.
    First, SO_MARK option now has effect on UDP sockets too.
    Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
    lookup correctly if TCP sockets with SO_MARK were used.

    Signed-off-by: Atis Elsts
    Acked-by: Eric Dumazet

    Atis Elsts
     

15 Sep, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
    netxen: update copyright
    netxen: fix tx timeout recovery
    netxen: fix file firmware leak
    netxen: improve pci memory access
    netxen: change firmware write size
    tg3: Fix return ring size breakage
    netxen: build fix for INET=n
    cdc-phonet: autoconfigure Phonet address
    Phonet: back-end for autoconfigured addresses
    Phonet: fix netlink address dump error handling
    ipv6: Add IFA_F_DADFAILED flag
    net: Add DEVTYPE support for Ethernet based devices
    mv643xx_eth.c: remove unused txq_set_wrr()
    ucc_geth: Fix hangs after switching from full to half duplex
    ucc_geth: Rearrange some code to avoid forward declarations
    phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
    drivers/net/phy: introduce missing kfree
    drivers/net/wan: introduce missing kfree
    net: force bridge module(s) to be GPL
    Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
    ...

    Fixed up trivial conflicts:

    - arch/x86/include/asm/socket.h

    converted to in the x86 tree. The generic
    header has the same new #define's, so that works out fine.

    - drivers/net/tun.c

    fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
    switched over to using 'tun->socket.sk' instead of the redundantly
    available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
    to the TUN driver") which added a new 'tun->sk' use.

    Noted in 'next' by Stephen Rothwell.

    Linus Torvalds
     

03 Sep, 2009

1 commit

  • Christoph Lameter pointed out that packet drops at qdisc level where not
    accounted in SNMP counters. Only if application sets IP_RECVERR, drops
    are reported to user (-ENOBUFS errors) and SNMP counters updated.

    IP_RECVERR is used to enable extended reliable error message passing,
    but these are not needed to update system wide SNMP stats.

    This patch changes things a bit to allow SNMP counters to be updated,
    regardless of IP_RECVERR being set or not on the socket.

    Example after an UDP tx flood
    # netstat -s
    ...
    IP:
    1487048 outgoing packets dropped
    ...
    Udp:
    ...
    SndbufErrors: 1487048

    send() syscalls, do however still return an OK status, to not
    break applications.

    Note : send() manual page explicitly says for -ENOBUFS error :

    "The output queue for a network interface was full.
    This generally indicates that the interface has stopped sending,
    but may be caused by transient congestion.
    (Normally, this does not occur in Linux. Packets are just silently
    dropped when a device queue overflows.) "

    This is not true for IP_RECVERR enabled sockets : a send() syscall
    that hit a qdisc drop returns an ENOBUFS error.

    Many thanks to Christoph, David, and last but not least, Alexey !

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Aug, 2009

1 commit


12 Jul, 2009

1 commit

  • After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    we do not take any more references on sk->sk_refcnt on outgoing packets.

    I forgot to delete two __sock_put() from ip_push_pending_frames()
    and ip6_push_pending_frames().

    Reported-by: Emil S Tantilov
    Signed-off-by: Eric Dumazet
    Tested-by: Emil S Tantilov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2009

1 commit

  • One of the problem with sock memory accounting is it uses
    a pair of sock_hold()/sock_put() for each transmitted packet.

    This slows down bidirectional flows because the receive path
    also needs to take a refcount on socket and might use a different
    cpu than transmit path or transmit completion path. So these
    two atomic operations also trigger cache line bounces.

    We can see this in tx or tx/rx workloads (media gateways for example),
    where sock_wfree() can be in top five functions in profiles.

    We use this sock_hold()/sock_put() so that sock freeing
    is delayed until all tx packets are completed.

    As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
    by one unit at init time, until sk_free() is called.
    Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
    to decrement initial offset and atomicaly check if any packets
    are in flight.

    skb_set_owner_w() doesnt call sock_hold() anymore

    sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
    reached 0 to perform the final freeing.

    Drawback is that a skb->truesize error could lead to unfreeable sockets, or
    even worse, prematurely calling __sk_free() on a live socket.

    Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
    on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
    contention point. 5 % speedup on a UDP transmit workload (depends
    on number of flows), lowering TX completion cpu usage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2009

1 commit


03 Jun, 2009

2 commits

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

    Delete skb->rtable field

    Setting rtable is not allowed, just set dst instead as rtable is an alias.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Apr, 2009

1 commit

  • The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
    OutMcastOctets:
    http://tools.ietf.org/html/rfc4293
    But it seems we don't track those in any way that easy to separate from other
    protocols. This patch adds those missing counters to the stats file. Tested
    successfully by me

    With help from Eric Dumazet.

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

16 Feb, 2009

1 commit


25 Nov, 2008

2 commits

  • We can reduce pressure on dst entry refcount that slowdown UDP transmit
    path on SMP machines. This pressure is visible on RTP servers when
    delivering content to mediagateways, especially big ones, handling
    thousand of streams. Several cpus send UDP frames to the same
    destination, hence use the same dst entry.

    This patch makes ip_push_pending_frames() steal the refcount its
    callers had to take when filling inet->cork.dst.

    This doesnt avoid all refcounting, but still gives speedups on SMP,
    on UDP/RAW transmit path.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We can reduce pressure on dst entry refcount that slowdown UDP transmit
    path on SMP machines. This pressure is visible on RTP servers when
    delivering content to mediagateways, especially big ones, handling
    thousand of streams. Several cpus send UDP frames to the same
    destination, hence use the same dst entry.

    This patch makes ip_append_data() eventually steal the refcount its
    callers had to take on the dst entry.

    This doesnt avoid all refcounting, but still gives speedups on SMP,
    on UDP/RAW transmit path

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Nov, 2008

1 commit


01 Oct, 2008

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

17 Jul, 2008

1 commit


15 Jul, 2008

1 commit


12 Jun, 2008

1 commit


30 Apr, 2008

1 commit

  • Problem: ip_append_data() could wrongly generate a chained skb for
    devices which support UFO. When sk_write_queue is not empty
    (e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
    of the queued skb, a new chained skb is created.

    I would normally assume UFO device should get data in nr_frags and not
    in frag_list. Later the udp4_hwcsum_outgoing() resets csum to NONE
    and skb_gso_segment() has oops.

    Proposal:
    1. Even length is less than mtu, employ ip_ufo_append_data()
    and append data to the __existed__ skb in the sk_write_queue.

    2. ip_ufo_append_data() is fixed due to a wrong manipulation of
    peek-ing and later enqueue-ing of the same skb. Now, enqueuing is
    always performed, because on error the further
    ip_flush_pending_frames() would release the queued skb.

    Signed-off-by: Kostya B
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Kostya B
     

26 Mar, 2008

1 commit


25 Mar, 2008

2 commits


06 Mar, 2008

1 commit


01 Feb, 2008

2 commits

  • A userspace program may wish to set the mark for each packets its send
    without using the netfilter MARK target. Changing the mark can be used
    for mark based routing without netfilter or for packet filtering.

    It requires CAP_NET_ADMIN capability.

    Signed-off-by: Laszlo Attila Toth
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Laszlo Attila Toth
     
  • When ip_fragment has to hit the slow path the value of skb->truesize
    may go out of sync because we would have updated it without changing
    the packet length. This violates the constraints on truesize.

    This patch postpones the update of skb->truesize to prevent this.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

29 Jan, 2008

6 commits


23 Jan, 2008

2 commits


07 Nov, 2007

1 commit


24 Oct, 2007

1 commit


16 Oct, 2007

1 commit

  • Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc
    can generate tail calls for some of the netfilter hook okfn invocations,
    so there is no need to inline the functions anymore. This caused huge
    code bloat since we ended up with one inlined version and one out-of-line
    version since we pass the address to nf_hook_slow.

    Before:
    text data bss dec hex filename
    8997385 1016524 524652 10538561 a0ce41 vmlinux

    After:
    text data bss dec hex filename
    8994009 1016524 524652 10535185 a0c111 vmlinux
    -------------------------------------------------------
    -3376

    All cases have been verified to generate tail-calls with and without
    netfilter. The okfns in ipmr and xfrm4_input still remain inline because
    gcc can't generate tail-calls for them.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

11 Oct, 2007

2 commits

  • Since hardware header operations are part of the protocol class
    not the device instance, make them into a separate object and
    save memory.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Background: RFC 4293 deprecates existing individual, named ICMP
    type counters to be replaced with the ICMPMsgStatsTable. This table
    includes entries for both IPv4 and IPv6, and requires counting of all
    ICMP types, whether or not the machine implements the type.

    These patches "remove" (but not really) the existing counters, and
    replace them with the ICMPMsgStats tables for v4 and v6.
    It includes the named counters in the /proc places they were, but gets the
    values for them from the new tables. It also counts packets generated
    from raw socket output (e.g., OutEchoes, MLD queries, RA's from
    radvd, etc).

    Changes:
    1) create icmpmsg_statistics mib
    2) create icmpv6msg_statistics mib
    3) modify existing counters to use these
    4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
    listed by number for easy SNMP parsing
    5) modify /proc/net/snmp printing for "Icmp" to get the named data
    from new counters.

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     

14 Aug, 2007

1 commit