23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

2 commits

  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The pretty much brings in the kitchen sink along
    with it, so it should be avoided wherever reasonably possible in
    terms of being included from other commonly used
    files, as it results in a measureable increase on compile times.

    The worst culprit was probably device.h since it is used everywhere.
    This file also had an implicit dependency/usage of mutex.h which was
    masked by module.h, and is also fixed here at the same time.

    There are over a dozen other headers that simply declare the
    struct instead of pulling in the whole file, so follow their lead
    and simply make it a few more.

    Most of the implicit dependencies on module.h being present by
    these headers pulling it in have been now weeded out, so we can
    finally make this change with hopefully minimal breakage.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

26 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds
     

09 Jul, 2011

1 commit


07 Jul, 2011

1 commit


06 Jul, 2011

1 commit


28 Jun, 2011

2 commits

  • Fix 'make htmldocs' warnings:

    Warning(/include/linux/hrtimer.h:153): No description found for
    parameter 'clockid'
    Warning(/include/linux/device.h:604): Excess struct/union/enum/typedef
    member 'of_match' description in 'device'
    Warning(/include/net/sock.h:349): Excess struct/union/enum/typedef
    member 'sk_rmem_alloc' description in 'sock'

    Signed-off-by: Vitaliy Ivanov
    Acked-by: Grant Likely
    Acked-by: David S. Miller
    Acked-by: Randy Dunlap
    Signed-off-by: Jiri Kosina

    Vitaliy Ivanov
     
  • Fix 'make htmldocs' warnings:

    Warning(/include/linux/hrtimer.h:153): No description found for parameter 'clockid'
    Warning(/include/linux/device.h:604): Excess struct/union/enum/typedef member 'of_match' description in 'device'
    Warning(/include/net/sock.h:349): Excess struct/union/enum/typedef member 'sk_rmem_alloc' description in 'sock'

    Signed-off-by: Vitaliy Ivanov
    Acked-by: Grant Likely
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Vitaliy Ivanov
     

07 Jun, 2011

1 commit


12 Apr, 2011

1 commit


07 Apr, 2011

1 commit

  • commit c6e1a0d12ca7b4f22c58e55a16beacfb7d3d8462 broken the calc
    (net: Allow no-cache copy from user on transmit)
    of checksum, which may cause some tcp packets be dropped because
    incorrect checksum. ssh does not work under today's net-next-2.6
    tree.

    Signed-off-by: Wei Yongjun
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Wei Yongjun
     

05 Apr, 2011

1 commit

  • This patch uses __copy_from_user_nocache on transmit to bypass data
    cache for a performance improvement. skb_add_data_nocache and
    skb_copy_to_page_nocache can be called by sendmsg functions to use
    this feature, initial support is in tcp_sendmsg. This functionality is
    configurable per device using ethtool.

    Presumably, this feature would only be useful when the driver does
    not touch the data. The feature is turned on by default if a device
    indicates that it does some form of checksum offload; it is off by
    default for devices that do no checksum offload or indicate no checksum
    is necessary. For the former case copy-checksum is probably done
    anyway, in the latter case the device is likely loopback in which case
    the no cache copy is probably not beneficial.

    This patch was tested using 200 instances of netperf TCP_RR with
    1400 byte request and one byte reply. Platform is 16 core AMD x86.

    No-cache copy disabled:
    672703 tps, 97.13% utilization
    50/90/99% latency:244.31 484.205 1028.41

    No-cache copy enabled:
    702113 tps, 96.16% utilization,
    50/90/99% latency 238.56 467.56 956.955

    Using 14000 byte request and response sizes demonstrate the
    effects more dramatically:

    No-cache copy disabled:
    79571 tps, 34.34 %utlization
    50/90/95% latency 1584.46 2319.59 5001.76

    No-cache copy enabled:
    83856 tps, 34.81% utilization
    50/90/95% latency 2508.42 2622.62 2735.88

    Note especially the effect on latency tail (95th percentile).

    This seems to provide a nice performance improvement and is
    consistent in the tests I ran. Presumably, this would provide
    the greatest benfits in the presence of an application workload
    stressing the cache and a lot of transmit data happening.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

31 Mar, 2011

1 commit


23 Feb, 2011

1 commit


01 Feb, 2011

1 commit


30 Jan, 2011

1 commit

  • SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
    which unfortunately means the existing infrastructure for compat networking
    ioctls is insufficient. A trivial compact ioctl implementation would conflict
    with:

    SIOCAX25ADDUID
    SIOCAIPXPRISLT
    SIOCGETSGCNT_IN6
    SIOCGETSGCNT
    SIOCRSSCAUSE
    SIOCX25SSUBSCRIP
    SIOCX25SDTEFACILITIES

    To make this work I have updated the compat_ioctl decode path to mirror the
    the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
    so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
    function into struct proto so I can break out ioctls by which kind of ip socket
    I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
    works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
    ipmr_ioctl.

    This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
    has unsigned longs in it so changes between 32bit and 64bit kernels.

    This change was sufficient to run a 32bit ip multicast routing daemon on a
    64bit kernel.

    Reported-by: Bill Fenner
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

19 Jan, 2011

1 commit


10 Jan, 2011

1 commit


18 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Dec, 2010

1 commit


07 Dec, 2010

1 commit

  • Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and
    sk_clone() in commit 47e958eac280c263397

    Problem is we can have several clones sharing a common sk_filter, and
    these clones might want to sk_filter_attach() their own filters at the
    same time, and can overwrite old_filter->rcu, corrupting RCU queues.

    We can not use filter->rcu without being sure no other thread could do
    the same thing.

    Switch code to a more conventional ref-counting technique : Do the
    atomic decrement immediately and queue one rcu call back when last
    reference is released.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2010

1 commit


17 Nov, 2010

2 commits

  • Right now, fields in struct sock are not optimally ordered, because each
    path (RX softirq, TX completion, RX user, TX user) has to touch fields
    that are contained in many different cache lines.

    The really critical thing is to shrink number of cache lines that are
    used at RX softirq time : CPU handling softirqs for a device can receive
    many frames per second for many sockets. If load is too big, we can drop
    frames at NIC level. RPS or multiqueue cards can help, but better reduce
    latency if possible.

    This patch starts with UDP protocol, then additional patches will try to
    reduce latencies of other ones as well.

    At RX softirq time, fields of interest for UDP protocol are :
    (not counting ones in inet struct for the lookup)

    Read/Written:
    sk_refcnt (atomic increment/decrement)
    sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
    sk_receive_queue
    sk_backlog (if socket locked by user program)
    sk_rxhash
    sk_forward_alloc
    sk_drops

    Read only:
    sk_rcvbuf (sk_rcvqueues_full())
    sk_filter
    sk_wq
    sk_policy[0]
    sk_flags

    Additional notes :

    - sk_backlog has one hole on 64bit arches. We can fill it to save 8
    bytes.
    - sk_backlog is used only if RX sofirq handler finds the socket while
    locked by user.
    - sk_rxhash is written only once per flow.
    - sk_drops is written only if queues are full

    Final layout :

    [1] One section grouping all read/write fields, but placing rxhash and
    sk_backlog at the end of this section.

    [2] One section grouping all read fields in RX handler
    (sk_filter, sk_rcv_buf, sk_wq)

    [3] Section used by other paths

    I'll post a patch on its own to put sk_refcnt at the end of struct
    sock_common so that it shares same cache line than section [1]

    New offsets on 64bit arch :

    sizeof(struct sock)=0x268
    offsetof(struct sock, sk_refcnt) =0x10
    offsetof(struct sock, sk_lock) =0x48
    offsetof(struct sock, sk_receive_queue)=0x68
    offsetof(struct sock, sk_backlog)=0x80
    offsetof(struct sock, sk_rmem_alloc)=0x80
    offsetof(struct sock, sk_forward_alloc)=0x98
    offsetof(struct sock, sk_rxhash)=0x9c
    offsetof(struct sock, sk_rcvbuf)=0xa4
    offsetof(struct sock, sk_drops) =0xa0
    offsetof(struct sock, sk_filter)=0xa8
    offsetof(struct sock, sk_wq)=0xb0
    offsetof(struct sock, sk_policy)=0xd0
    offsetof(struct sock, sk_flags) =0xe0

    Instead of :

    sizeof(struct sock)=0x270
    offsetof(struct sock, sk_refcnt) =0x10
    offsetof(struct sock, sk_lock) =0x50
    offsetof(struct sock, sk_receive_queue)=0xc0
    offsetof(struct sock, sk_backlog)=0x70
    offsetof(struct sock, sk_rmem_alloc)=0xac
    offsetof(struct sock, sk_forward_alloc)=0x10c
    offsetof(struct sock, sk_rxhash)=0x128
    offsetof(struct sock, sk_rcvbuf)=0x4c
    offsetof(struct sock, sk_drops) =0x16c
    offsetof(struct sock, sk_filter)=0x198
    offsetof(struct sock, sk_wq)=0x88
    offsetof(struct sock, sk_policy)=0x98
    offsetof(struct sock, sk_flags) =0x130

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • UDP sockets refcount is usually 2, unless an incoming frame is going to
    be queued in receive or backlog queue.

    Using atomic_inc_not_zero_hint() permits to reduce latency, because
    processor issues less memory transactions.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Nov, 2010

1 commit

  • Robin Holt tried to boot a 16TB machine and found some limits were
    reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

    We can switch infrastructure to use long "instead" of "int", now
    atomic_long_t primitives are available for free.

    Signed-off-by: Eric Dumazet
    Reported-by: Robin Holt
    Reviewed-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2010

1 commit


27 Sep, 2010

1 commit

  • SOCK_MIN_RCVBUF current value is 256 bytes

    It doesnt permit to receive the smallest possible frame, considering
    socket sk_rmem_alloc/sk_rcvbuf account skb truesizes. On 64bit arches,
    sizeof(struct sk_buff) is 240 bytes. Add the typical 64 bytes of
    headroom, and we go over the limit.

    With old kernels and 32bit arches, we were under the limit, if netdriver
    was doing copybreak.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Sep, 2010

1 commit


09 Sep, 2010

1 commit

  • commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
    added a secondary hash on UDP, hashed on (local addr, local port).

    Problem is that following sequence :

    fd = socket(...)
    connect(fd, &remote, ...)

    not only selects remote end point (address and port), but also sets
    local address, while UDP stack stored in secondary hash table the socket
    while its local address was INADDR_ANY (or ipv6 equivalent)

    Sequence is :
    - autobind() : choose a random local port, insert socket in hash tables
    [while local address is INADDR_ANY]
    - connect() : set remote address and port, change local address to IP
    given by a route lookup.

    When an incoming UDP frame comes, if more than 10 sockets are found in
    primary hash table, we switch to secondary table, and fail to find
    socket because its local address changed.

    One solution to this problem is to rehash datagram socket if needed.

    We add a new rehash(struct socket *) method in "struct proto", and
    implement this method for UDP v4 & v6, using a common helper.

    This rehashing only takes care of secondary hash table, since primary
    hash (based on local port only) is not changed.

    Reported-by: Krzysztof Piotr Oledzki
    Signed-off-by: Eric Dumazet
    Tested-by: Krzysztof Piotr Oledzki
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Aug, 2010

1 commit

  • This patch removes the abstraction introduced by the union skb_shared_tx in
    the shared skb data.

    The access of the different union elements at several places led to some
    confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

    Signed-off-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Oliver Hartkopp
     

10 Aug, 2010

1 commit

  • Add missing kernel-doc notation to struct sock:

    Warning(include/net/sock.h:324): No description found for parameter 'sk_peer_pid'
    Warning(include/net/sock.h:324): No description found for parameter 'sk_peer_cred'
    Warning(include/net/sock.h:324): No description found for parameter 'sk_classid'
    Warning(include/net/sock.h:324): Excess struct/union/enum/typedef member 'sk_peercred' description in 'sock'

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

21 Jul, 2010

1 commit


15 Jul, 2010

1 commit

  • Fix problem in reading the tx_queue recorded in a socket. In
    dev_pick_tx, the TX queue is read by doing a check with
    sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
    The problem is that there is not mutual exclusion across these
    calls in the socket so it it is possible that the queue in the
    sock can be invalidated after sk_tx_queue_recorded is called so
    that sk_tx_queue get returns -1, which sets 65535 in queue_index
    and thus dev_pick_tx returns 65536 which is a bogus queue and
    can cause crash in dev_queue_xmit.

    We fix this by only calling sk_tx_queue_get which does the proper
    checks. The interface is that sk_tx_queue_get returns the TX queue
    if the sock argument is non-NULL and TX queue is recorded, else it
    returns -1. sk_tx_queue_recorded is no longer used so it can be
    completely removed.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

13 Jul, 2010

1 commit

  • a new boolean flag no_autobind is added to structure proto to avoid the autobind
    calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
    TCP's sendmsg() and sendpage() pathes.

    Signed-off-by: Changli Gao
    ----
    include/net/inet_common.h | 4 ++++
    include/net/sock.h | 1 +
    include/net/tcp.h | 8 ++++----
    net/ipv4/af_inet.c | 15 +++++++++------
    net/ipv4/tcp.c | 11 +++++------
    net/ipv4/tcp_ipv4.c | 3 +++
    net/ipv6/af_inet6.c | 8 ++++----
    net/ipv6/tcp_ipv6.c | 3 +++
    8 files changed, 33 insertions(+), 20 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

17 Jun, 2010

1 commit

  • Use struct pid and struct cred to store the peer credentials on struct
    sock. This gives enough information to convert the peer credential
    information to a value relative to whatever namespace the socket is in
    at the time.

    This removes nasty surprises when using SO_PEERCRED on socket
    connetions where the processes on either side are in different pid and
    user namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman