13 Jul, 2010

1 commit

  • a new boolean flag no_autobind is added to structure proto to avoid the autobind
    calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
    TCP's sendmsg() and sendpage() pathes.

    Signed-off-by: Changli Gao
    ----
    include/net/inet_common.h | 4 ++++
    include/net/sock.h | 1 +
    include/net/tcp.h | 8 ++++----
    net/ipv4/af_inet.c | 15 +++++++++------
    net/ipv4/tcp.c | 11 +++++------
    net/ipv4/tcp_ipv4.c | 3 +++
    net/ipv6/af_inet6.c | 8 ++++----
    net/ipv6/tcp_ipv6.c | 3 +++
    8 files changed, 33 insertions(+), 20 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

01 Jul, 2010

1 commit

  • /proc/net/snmp and /proc/net/netstat expose SNMP counters.

    Width of these counters is either 32 or 64 bits, depending on the size
    of "unsigned long" in kernel.

    This means user program parsing these files must already be prepared to
    deal with 64bit values, regardless of user program being 32 or 64 bit.

    This patch introduces 64bit snmp values for IPSTAT mib, where some
    counters can wrap pretty fast if they are 32bit wide.

    # netstat -s|egrep "InOctets|OutOctets"
    InOctets: 244068329096
    OutOctets: 244069348848

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jun, 2010

1 commit

  • In preparation for 64bit snmp counters for some mibs,
    add an 'align' parameter to snmp_mib_init(), instead
    of assuming mibs only contain 'unsigned long' fields.

    Callers can use __alignof__(type) to provide correct
    alignment.

    Signed-off-by: Eric Dumazet
    CC: Herbert Xu
    CC: Arnaldo Carvalho de Melo
    CC: Hideaki YOSHIFUJI
    CC: Vlad Yasevich
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jun, 2010

1 commit


11 Jun, 2010

1 commit


16 May, 2010

1 commit

  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

28 Apr, 2010

1 commit


21 Apr, 2010

2 commits

  • Sparse can help us find endianness bugs, but we need to make some
    cleanups to be able to more easily spot real bugs.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Apr, 2010

1 commit

  • This patch implements receive flow steering (RFS). RFS steers
    received packets for layer 3 and 4 processing to the CPU where
    the application for the corresponding flow is running. RFS is an
    extension of Receive Packet Steering (RPS).

    The basic idea of RFS is that when an application calls recvmsg
    (or sendmsg) the application's running CPU is stored in a hash
    table that is indexed by the connection's rxhash which is stored in
    the socket structure. The rxhash is passed in skb's received on
    the connection from netif_receive_skb. For each received packet,
    the associated rxhash is used to look up the CPU in the hash table,
    if a valid CPU is set then the packet is steered to that CPU using
    the RPS mechanisms.

    The convolution of the simple approach is that it would potentially
    allow OOO packets. If threads are thrashing around CPUs or multiple
    threads are trying to read from the same sockets, a quickly changing
    CPU value in the hash table could cause rampant OOO packets--
    we consider this a non-starter.

    To avoid OOO packets, this solution implements two types of hash
    tables: rps_sock_flow_table and rps_dev_flow_table.

    rps_sock_table is a global hash table. Each entry is just a CPU
    number and it is populated in recvmsg and sendmsg as described above.
    This table contains the "desired" CPUs for flows.

    rps_dev_flow_table is specific to each device queue. Each entry
    contains a CPU and a tail queue counter. The CPU is the "current"
    CPU for a matching flow. The tail queue counter holds the value
    of a tail queue counter for the associated CPU's backlog queue at
    the time of last enqueue for a flow matching the entry.

    Each backlog queue has a queue head counter which is incremented
    on dequeue, and so a queue tail counter is computed as queue head
    count + queue length. When a packet is enqueued on a backlog queue,
    the current value of the queue tail counter is saved in the hash
    entry of the rps_dev_flow_table.

    And now the trick: when selecting the CPU for RPS (get_rps_cpu)
    the rps_sock_flow table and the rps_dev_flow table for the RX queue
    are consulted. When the desired CPU for the flow (found in the
    rps_sock_flow table) does not match the current CPU (found in the
    rps_dev_flow table), the current CPU is changed to the desired CPU
    if one of the following is true:

    - The current CPU is unset (equal to RPS_NO_CPU)
    - Current CPU is offline
    - The current CPU's queue head counter >= queue tail counter in the
    rps_dev_flow table. This checks if the queue tail has advanced
    beyond the last packet that was enqueued using this table entry.
    This guarantees that all packets queued using this entry have been
    dequeued, thus preserving in order delivery.

    Making each queue have its own rps_dev_flow table has two advantages:
    1) the tail queue counters will be written on each receive, so
    keeping the table local to interrupting CPU s good for locality. 2)
    this allows lockless access to the table-- the CPU number and queue
    tail counter need to be accessed together under mutual exclusion
    from netif_receive_skb, we assume that this is only called from
    device napi_poll which is non-reentrant.

    This patch implements RFS for TCP and connected UDP sockets.
    It should be usable for other flow oriented protocols.

    There are two configuration parameters for RFS. The
    "rps_flow_entries" kernel init parameter sets the number of
    entries in the rps_sock_flow_table, the per rxqueue sysfs entry
    "rps_flow_cnt" contains the number of entries in the rps_dev_flow
    table for the rxqueue. Both are rounded to power of two.

    The obvious benefit of RFS (over just RPS) is that it achieves
    CPU locality between the receive processing for a flow and the
    applications processing; this can result in increased performance
    (higher pps, lower latency).

    The benefits of RFS are dependent on cache hierarchy, application
    load, and other factors. On simple benchmarks, we don't necessarily
    see improvement and sometimes see degradation. However, for more
    complex benchmarks and for applications where cache pressure is
    much higher this technique seems to perform very well.

    Below are some benchmark results which show the potential benfit of
    this patch. The netperf test has 500 instances of netperf TCP_RR
    test with 1 byte req. and resp. The RPC test is an request/response
    test similar in structure to netperf RR test ith 100 threads on
    each host, but does more work in userspace that netperf.

    e1000e on 8 core Intel
    No RFS or RPS 104K tps at 30% CPU
    No RFS (best RPS config): 290K tps at 63% CPU
    RFS 303K tps at 61% CPU

    RPC test tps CPU% 50/90/99% usec latency Latency StdDev
    No RFS/RPS 103K 48% 757/900/3185 4472.35
    RPS only: 174K 73% 415/993/2468 491.66
    RFS 223K 73% 379/651/1382 315.61

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert
     

13 Apr, 2010

1 commit

  • With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.

    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)

    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))

    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Apr, 2010

1 commit


07 Apr, 2010

1 commit


06 Apr, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (37 commits)
    smc91c92_cs: fix the problem of "Unable to find hardware address"
    r8169: clean up my printk uglyness
    net: Hook up cxgb4 to Kconfig and Makefile
    cxgb4: Add main driver file and driver Makefile
    cxgb4: Add remaining driver headers and L2T management
    cxgb4: Add packet queues and packet DMA code
    cxgb4: Add HW and FW support code
    cxgb4: Add register, message, and FW definitions
    netlabel: Fix several rcu_dereference() calls used without RCU read locks
    bonding: fix potential deadlock in bond_uninit()
    net: check the length of the socket address passed to connect(2)
    stmmac: add documentation for the driver.
    stmmac: fix kconfig for crc32 build error
    be2net: fix bug in vlan rx path for big endian architecture
    be2net: fix flashing on big endian architectures
    be2net: fix a bug in flashing the redboot section
    bonding: bond_xmit_roundrobin() fix
    drivers/net: Add missing unlock
    net: gianfar - align BD ring size console messages
    net: gianfar - initialize per-queue statistics
    ...

    Linus Torvalds
     

02 Apr, 2010

1 commit

  • check the length of the socket address passed to connect(2).

    Check the length of the socket address passed to connect(2). If the
    length is invalid, -EINVAL will be returned.

    Signed-off-by: Changli Gao
    ----
    net/bluetooth/l2cap.c | 3 ++-
    net/bluetooth/rfcomm/sock.c | 3 ++-
    net/bluetooth/sco.c | 3 ++-
    net/can/bcm.c | 3 +++
    net/ieee802154/af_ieee802154.c | 3 +++
    net/ipv4/af_inet.c | 5 +++++
    net/netlink/af_netlink.c | 3 +++
    7 files changed, 20 insertions(+), 3 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

22 Mar, 2010

1 commit

  • There is no point to align or pad mibs to cache lines, they are per cpu
    allocated with a 8 bytes alignment anyway.
    This wastes space for no gain. This patch removes __SNMP_MIB_ALIGN__

    Since SNMP mibs contain "unsigned long" fields only, we can relax the
    allocation alignment from "unsigned long long" to "unsigned long"

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to net.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    The macro and type tricks around snmp stats make things a bit
    interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
    as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
    snmp_mib_*() users which used to cast the argument to (void **) are
    updated to cast it to (void __percpu **).

    Signed-off-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Patrick McHardy
    Cc: Arnaldo Carvalho de Melo
    Cc: Vlad Yasevich
    Cc: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Tejun Heo
     

06 Nov, 2009

3 commits

  • Before calling capable(CAP_NET_RAW) check if this operations is on behalf
    of the kernel or on behalf of userspace. Do not do the security check if
    it is on behalf of the kernel.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     
  • The generic __sock_create function has a kern argument which allows the
    security system to make decisions based on if a socket is being created by
    the kernel or by userspace. This patch passes that flag to the
    net_proto_family specific create function, so it can do the same thing.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     
  • struct can_proto had a capability field which wasn't ever used. It is
    dropped entirely.

    struct inet_protosw had a capability field which can be more clearly
    expressed in the code by just checking if sock->type = SOCK_RAW.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     

29 Oct, 2009

1 commit

  • proto_ops->getname implies copying protocol specific data
    into storage unit (particulary to __kernel_sockaddr_storage).
    So when we implement new protocol support we should keep such
    a detail in mind (which is easy to forget about).

    Lets introduce DECLARE_SOCKADDR helper which check if
    storage unit is not overfowed at build time.

    Eventually inet_getname is switched to use DECLARE_SOCKADDR
    (to show example of usage).

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Oct, 2009

1 commit


02 Oct, 2009

1 commit

  • This patch against v2.6.31 adds support for route lookup using sk_mark in some
    more places. The benefits from this patch are the following.
    First, SO_MARK option now has effect on UDP sockets too.
    Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
    lookup correctly if TCP sockets with SO_MARK were used.

    Signed-off-by: Atis Elsts
    Acked-by: Eric Dumazet

    Atis Elsts
     

15 Sep, 2009

1 commit


29 Aug, 2009

1 commit


13 Jul, 2009

1 commit


04 Jun, 2009

1 commit


02 Jun, 2009

1 commit

  • After some discussion offline with Christoph Lameter and David Stevens
    regarding multicast behaviour in Linux, I'm submitting a slightly
    modified patch from the one Christoph submitted earlier.

    This patch provides a new socket option IP_MULTICAST_ALL.

    In this case, default behaviour is _unchanged_ from the current
    Linux standard. The socket option is set by default to provide
    original behaviour. Sockets wishing to receive data only from
    multicast groups they join explicitly will need to clear this
    socket option.

    Signed-off-by: Nivedita Singhvi
    Signed-off-by: Christoph Lameter
    Acked-by: David Stevens
    Signed-off-by: David S. Miller

    Nivedita Singhvi
     

27 May, 2009

2 commits


17 Apr, 2009

1 commit

  • inet_register_protosw() function is responsible for adding a new
    inet protocol into a global table (inetsw[]) that is used with RCU rules.

    As soon as the store of the pointer is done, other cpus might see
    this new protocol in inetsw[], so we have to make sure new protocol
    is ready for use. All pending memory updates should thus be committed
    to memory before setting the pointer.
    This is correctly done using rcu_assign_pointer()

    synchronize_net() is typically used at unregister time, after
    unsetting the pointer, to make sure no other cpu is still using
    the object we want to dismantle. Using it at register time
    is only adding an artificial delay that could hide a real bug,
    and this bug could popup if/when synchronize_rcu() can proceed
    faster than now.

    This saves about 13 ms on boot time on a HZ=1000 8 cpus machine ;)
    (4 calls to inet_register_protosw(), and about 3200 us per call)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2009

1 commit


10 Mar, 2009

1 commit


20 Feb, 2009

1 commit


09 Feb, 2009

1 commit

  • As this function can be called more than half a million times for
    10GbE, it's important to optimise it as much as we can.

    This patch does some obvious changes to use 2-byte and 4-byte
    operations instead of byte-oriented ones where possible. Bit
    ops are also used to replace logical ops to reduce branching.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

02 Feb, 2009

1 commit


01 Feb, 2009

1 commit


30 Jan, 2009

1 commit

  • Unfortunately simplicity isn't always the best. The fraginfo
    interface turned out to be suboptimal. The problem was quite
    obvious. For every packet, we have to copy the headers from
    the frags structure into skb->head, even though for 99% of the
    packets this part is immediately thrown away after the merge.

    LRO didn't have this problem because it directly read the headers
    from the frags structure.

    This patch attempts to address this by creating an interface
    that allows GRO to access the headers in the first frag without
    having to copy it. Because all drivers that use frags place the
    headers in the first frag this optimisation should be enough.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu