24 Dec, 2011

1 commit


23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Dec, 2011

1 commit


14 Dec, 2011

1 commit


13 Dec, 2011

3 commits

  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • The goal of this work is to move the memory pressure tcp
    controls to a cgroup, instead of just relying on global
    conditions.

    To avoid excessive overhead in the network fast paths,
    the code that accounts allocated memory to a cgroup is
    hidden inside a static_branch(). This branch is patched out
    until the first non-root cgroup is created. So when nobody
    is using cgroups, even if it is mounted, no significant performance
    penalty should be seen.

    This patch handles the generic part of the code, and has nothing
    tcp-specific.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Kirill A. Shutemov
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

23 Nov, 2011

1 commit

  • This patch adds in the infrastructure code to create the network priority
    cgroup. The cgroup, in addition to the standard processes file creates two
    control files:

    1) prioidx - This is a read-only file that exports the index of this cgroup.
    This is a value that is both arbitrary and unique to a cgroup in this subsystem,
    and is used to index the per-device priority map

    2) priomap - This is a writeable file. On read it reports a table of 2-tuples
    where name is the name of a network interface and priority is
    indicates the priority assigned to frames egresessing on the named interface and
    originating from a pid in this cgroup

    This cgroup allows for skb priority to be set prior to a root qdisc getting
    selected. This is benenficial for DCB enabled systems, in that it allows for any
    application to use dcb configured priorities so without application modification

    Signed-off-by: Neil Horman
    Signed-off-by: John Fastabend
    CC: Robert Love
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

18 Nov, 2011

1 commit


17 Nov, 2011

1 commit


10 Nov, 2011

1 commit

  • The 802.1X EAPOL handshake hostapd does requires
    knowing whether the frame was ack'ed by the peer.
    Currently, we fudge this pretty badly by not even
    transmitting the frame as a normal data frame but
    injecting it with radiotap and getting the status
    out of radiotap monitor as well. This is rather
    complex, confuses users (mon.wlan0 presence) and
    doesn't work with all hardware.

    To get rid of that hack, introduce a real wifi TX
    status option for data frame transmissions.

    This works similar to the existing TX timestamping
    in that it reflects the SKB back to the socket's
    error queue with a SCM_WIFI_STATUS cmsg that has
    an int indicating ACK status (0/1).

    Since it is possible that at some point we will
    want to have TX timestamping and wifi status in a
    single errqueue SKB (there's little point in not
    doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
    to SO_EE_ORIGIN_TXSTATUS which can collect more
    than just the timestamp; keep the old constant
    as an alias of course. Currently the internal APIs
    don't make that possible, but it wouldn't be hard
    to split them up in a way that makes it possible.

    Thanks to Neil Horman for helping me figure out
    the functions that add the control messages.

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

09 Nov, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

2 commits

  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The pretty much brings in the kitchen sink along
    with it, so it should be avoided wherever reasonably possible in
    terms of being included from other commonly used
    files, as it results in a measureable increase on compile times.

    The worst culprit was probably device.h since it is used everywhere.
    This file also had an implicit dependency/usage of mutex.h which was
    masked by module.h, and is also fixed here at the same time.

    There are over a dozen other headers that simply declare the
    struct instead of pulling in the whole file, so follow their lead
    and simply make it a few more.

    Most of the implicit dependencies on module.h being present by
    these headers pulling it in have been now weeded out, so we can
    finally make this change with hopefully minimal breakage.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

26 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds
     

09 Jul, 2011

1 commit


07 Jul, 2011

1 commit


06 Jul, 2011

1 commit


28 Jun, 2011

2 commits

  • Fix 'make htmldocs' warnings:

    Warning(/include/linux/hrtimer.h:153): No description found for
    parameter 'clockid'
    Warning(/include/linux/device.h:604): Excess struct/union/enum/typedef
    member 'of_match' description in 'device'
    Warning(/include/net/sock.h:349): Excess struct/union/enum/typedef
    member 'sk_rmem_alloc' description in 'sock'

    Signed-off-by: Vitaliy Ivanov
    Acked-by: Grant Likely
    Acked-by: David S. Miller
    Acked-by: Randy Dunlap
    Signed-off-by: Jiri Kosina

    Vitaliy Ivanov
     
  • Fix 'make htmldocs' warnings:

    Warning(/include/linux/hrtimer.h:153): No description found for parameter 'clockid'
    Warning(/include/linux/device.h:604): Excess struct/union/enum/typedef member 'of_match' description in 'device'
    Warning(/include/net/sock.h:349): Excess struct/union/enum/typedef member 'sk_rmem_alloc' description in 'sock'

    Signed-off-by: Vitaliy Ivanov
    Acked-by: Grant Likely
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Vitaliy Ivanov
     

07 Jun, 2011

1 commit


12 Apr, 2011

1 commit


07 Apr, 2011

1 commit

  • commit c6e1a0d12ca7b4f22c58e55a16beacfb7d3d8462 broken the calc
    (net: Allow no-cache copy from user on transmit)
    of checksum, which may cause some tcp packets be dropped because
    incorrect checksum. ssh does not work under today's net-next-2.6
    tree.

    Signed-off-by: Wei Yongjun
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Wei Yongjun
     

05 Apr, 2011

1 commit

  • This patch uses __copy_from_user_nocache on transmit to bypass data
    cache for a performance improvement. skb_add_data_nocache and
    skb_copy_to_page_nocache can be called by sendmsg functions to use
    this feature, initial support is in tcp_sendmsg. This functionality is
    configurable per device using ethtool.

    Presumably, this feature would only be useful when the driver does
    not touch the data. The feature is turned on by default if a device
    indicates that it does some form of checksum offload; it is off by
    default for devices that do no checksum offload or indicate no checksum
    is necessary. For the former case copy-checksum is probably done
    anyway, in the latter case the device is likely loopback in which case
    the no cache copy is probably not beneficial.

    This patch was tested using 200 instances of netperf TCP_RR with
    1400 byte request and one byte reply. Platform is 16 core AMD x86.

    No-cache copy disabled:
    672703 tps, 97.13% utilization
    50/90/99% latency:244.31 484.205 1028.41

    No-cache copy enabled:
    702113 tps, 96.16% utilization,
    50/90/99% latency 238.56 467.56 956.955

    Using 14000 byte request and response sizes demonstrate the
    effects more dramatically:

    No-cache copy disabled:
    79571 tps, 34.34 %utlization
    50/90/95% latency 1584.46 2319.59 5001.76

    No-cache copy enabled:
    83856 tps, 34.81% utilization
    50/90/95% latency 2508.42 2622.62 2735.88

    Note especially the effect on latency tail (95th percentile).

    This seems to provide a nice performance improvement and is
    consistent in the tests I ran. Presumably, this would provide
    the greatest benfits in the presence of an application workload
    stressing the cache and a lot of transmit data happening.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

31 Mar, 2011

1 commit


23 Feb, 2011

1 commit


01 Feb, 2011

1 commit


30 Jan, 2011

1 commit

  • SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
    which unfortunately means the existing infrastructure for compat networking
    ioctls is insufficient. A trivial compact ioctl implementation would conflict
    with:

    SIOCAX25ADDUID
    SIOCAIPXPRISLT
    SIOCGETSGCNT_IN6
    SIOCGETSGCNT
    SIOCRSSCAUSE
    SIOCX25SSUBSCRIP
    SIOCX25SDTEFACILITIES

    To make this work I have updated the compat_ioctl decode path to mirror the
    the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
    so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
    function into struct proto so I can break out ioctls by which kind of ip socket
    I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
    works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
    ipmr_ioctl.

    This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
    has unsigned longs in it so changes between 32bit and 64bit kernels.

    This change was sufficient to run a 32bit ip multicast routing daemon on a
    64bit kernel.

    Reported-by: Bill Fenner
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

19 Jan, 2011

1 commit


10 Jan, 2011

1 commit


18 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Dec, 2010

1 commit


07 Dec, 2010

1 commit

  • Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and
    sk_clone() in commit 47e958eac280c263397

    Problem is we can have several clones sharing a common sk_filter, and
    these clones might want to sk_filter_attach() their own filters at the
    same time, and can overwrite old_filter->rcu, corrupting RCU queues.

    We can not use filter->rcu without being sure no other thread could do
    the same thing.

    Switch code to a more conventional ref-counting technique : Do the
    atomic decrement immediately and queue one rcu call back when last
    reference is released.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2010

1 commit


17 Nov, 2010

2 commits

  • Right now, fields in struct sock are not optimally ordered, because each
    path (RX softirq, TX completion, RX user, TX user) has to touch fields
    that are contained in many different cache lines.

    The really critical thing is to shrink number of cache lines that are
    used at RX softirq time : CPU handling softirqs for a device can receive
    many frames per second for many sockets. If load is too big, we can drop
    frames at NIC level. RPS or multiqueue cards can help, but better reduce
    latency if possible.

    This patch starts with UDP protocol, then additional patches will try to
    reduce latencies of other ones as well.

    At RX softirq time, fields of interest for UDP protocol are :
    (not counting ones in inet struct for the lookup)

    Read/Written:
    sk_refcnt (atomic increment/decrement)
    sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
    sk_receive_queue
    sk_backlog (if socket locked by user program)
    sk_rxhash
    sk_forward_alloc
    sk_drops

    Read only:
    sk_rcvbuf (sk_rcvqueues_full())
    sk_filter
    sk_wq
    sk_policy[0]
    sk_flags

    Additional notes :

    - sk_backlog has one hole on 64bit arches. We can fill it to save 8
    bytes.
    - sk_backlog is used only if RX sofirq handler finds the socket while
    locked by user.
    - sk_rxhash is written only once per flow.
    - sk_drops is written only if queues are full

    Final layout :

    [1] One section grouping all read/write fields, but placing rxhash and
    sk_backlog at the end of this section.

    [2] One section grouping all read fields in RX handler
    (sk_filter, sk_rcv_buf, sk_wq)

    [3] Section used by other paths

    I'll post a patch on its own to put sk_refcnt at the end of struct
    sock_common so that it shares same cache line than section [1]

    New offsets on 64bit arch :

    sizeof(struct sock)=0x268
    offsetof(struct sock, sk_refcnt) =0x10
    offsetof(struct sock, sk_lock) =0x48
    offsetof(struct sock, sk_receive_queue)=0x68
    offsetof(struct sock, sk_backlog)=0x80
    offsetof(struct sock, sk_rmem_alloc)=0x80
    offsetof(struct sock, sk_forward_alloc)=0x98
    offsetof(struct sock, sk_rxhash)=0x9c
    offsetof(struct sock, sk_rcvbuf)=0xa4
    offsetof(struct sock, sk_drops) =0xa0
    offsetof(struct sock, sk_filter)=0xa8
    offsetof(struct sock, sk_wq)=0xb0
    offsetof(struct sock, sk_policy)=0xd0
    offsetof(struct sock, sk_flags) =0xe0

    Instead of :

    sizeof(struct sock)=0x270
    offsetof(struct sock, sk_refcnt) =0x10
    offsetof(struct sock, sk_lock) =0x50
    offsetof(struct sock, sk_receive_queue)=0xc0
    offsetof(struct sock, sk_backlog)=0x70
    offsetof(struct sock, sk_rmem_alloc)=0xac
    offsetof(struct sock, sk_forward_alloc)=0x10c
    offsetof(struct sock, sk_rxhash)=0x128
    offsetof(struct sock, sk_rcvbuf)=0x4c
    offsetof(struct sock, sk_drops) =0x16c
    offsetof(struct sock, sk_filter)=0x198
    offsetof(struct sock, sk_wq)=0x88
    offsetof(struct sock, sk_policy)=0x98
    offsetof(struct sock, sk_flags) =0x130

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • UDP sockets refcount is usually 2, unless an incoming frame is going to
    be queued in receive or backlog queue.

    Using atomic_inc_not_zero_hint() permits to reduce latency, because
    processor issues less memory transactions.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet