17 Dec, 2011

1 commit


05 Dec, 2011

1 commit

  • We discovered that TCP stack could retransmit misaligned skbs if a
    malicious peer acknowledged sub MSS frame. This currently can happen
    only if output interface is non SG enabled : If SG is enabled, tcp
    builds headless skbs (all payload is included in fragments), so the tcp
    trimming process only removes parts of skb fragments, header stay
    aligned.

    Some arches cant handle misalignments, so force a head reallocation and
    shrink headroom to MAX_TCP_HEADER.

    Dont care about misaligments on x86 and PPC (or other arches setting
    NET_IP_ALIGN to 0)

    This patch introduces __pskb_copy() which can specify the headroom of
    new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2011

1 commit


23 Nov, 2011

1 commit


18 Nov, 2011

1 commit


17 Nov, 2011

1 commit


15 Nov, 2011

1 commit

  • One of the thing we discussed during netdev 2011 conference was the idea
    to change some network drivers to allocate/populate their skb at RX
    completion time, right before feeding the skb to network stack.

    In old days, we allocated skbs when populating the RX ring.

    This means bringing into cpu cache sk_buff and skb_shared_info cache
    lines (since we clear/initialize them), then 'queue' skb->data to NIC.

    By the time NIC fills a frame in skb->data buffer and host can process
    it, cpu probably threw away the cache lines from its caches, because lot
    of things happened between the allocation and final use.

    So the deal would be to allocate only the data buffer for the NIC to
    populate its RX ring buffer. And use build_skb() at RX completion to
    attach a data buffer (now filled with an ethernet frame) to a new skb,
    initialize the skb_shared_info portion, and give the hot skb to network
    stack.

    build_skb() is the function to allocate an skb, caller providing the
    data buffer that should be attached to it. Drivers are expected to call
    skb_reserve() right after build_skb() to adjust skb->data to the
    Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
    drivers might add a hardware provided alignment)

    Data provided to build_skb() MUST have been allocated by a prior
    kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
    skb_shared_info)) bytes at the end of the data without corrupting
    incoming frame.

    data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
    GFP_ATOMIC);
    ...
    skb = build_skb(data);
    if (!skb) {
    recycle_data(data);
    } else {
    skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
    ...
    }

    Signed-off-by: Eric Dumazet
    CC: Eilon Greenstein
    CC: Ben Hutchings
    CC: Tom Herbert
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    CC: Thomas Graf
    CC: Herbert Xu
    CC: Jeff Kirsher
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Nov, 2011

1 commit

  • The 802.1X EAPOL handshake hostapd does requires
    knowing whether the frame was ack'ed by the peer.
    Currently, we fudge this pretty badly by not even
    transmitting the frame as a normal data frame but
    injecting it with radiotap and getting the status
    out of radiotap monitor as well. This is rather
    complex, confuses users (mon.wlan0 presence) and
    doesn't work with all hardware.

    To get rid of that hack, introduce a real wifi TX
    status option for data frame transmissions.

    This works similar to the existing TX timestamping
    in that it reflects the SKB back to the socket's
    error queue with a SCM_WIFI_STATUS cmsg that has
    an int indicating ACK status (0/1).

    Since it is possible that at some point we will
    want to have TX timestamping and wifi status in a
    single errqueue SKB (there's little point in not
    doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
    to SO_EE_ORIGIN_TXSTATUS which can collect more
    than just the timestamp; keep the old constant
    as an alias of course. Currently the internal APIs
    don't make that possible, but it wouldn't be hard
    to split them up in a way that makes it possible.

    Thanks to Neil Horman for helping me figure out
    the functions that add the control messages.

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

04 Nov, 2011

1 commit

  • Commit 87fb4b7b533073eeeaed0b6bf7c2328995f6c075 (net: more
    accurate skb truesize) changed the alignment of size. This
    can cause problems at least on some machines with NFS root:

    Unhandled fault: alignment exception (0x801) at 0xc183a43a
    Internal error: : 801 [#1] PREEMPT
    Modules linked in:
    CPU: 0 Not tainted (3.1.0-08784-g5eeee4a #733)
    pc : [] lr : [] psr: 60000013
    sp : c180fef8 ip : 00000000 fp : c181f580
    r10: 00000000 r9 : c044b28c r8 : 00000001
    r7 : c183a3a0 r6 : c1835be0 r5 : c183a412 r4 : 000001f2
    r3 : 00000000 r2 : 00000000 r1 : ffffffe6 r0 : c183a43a
    Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
    Control: 0005317f Table: 10004000 DAC: 00000017
    Process swapper (pid: 1, stack limit = 0xc180e270)
    Stack: (0xc180fef8 to 0xc1810000)
    fee0: 00000024 00000000
    ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
    ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
    ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
    ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
    ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
    ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
    ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
    ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
    Function entered at [] from []
    Function entered at [] from []
    Function entered at [] from []
    Function entered at [] from []
    Function entered at [] from []
    Function entered at [] from []
    Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)

    Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
    skb->end can be unaligned without this patch.

    As explained by Eric Dumazet , this happens
    only with SLOB, and not with SLAB or SLUB:

    * Eric Dumazet [111102 15:56]:
    >
    > Your patch is absolutely needed, I completely forgot about SLOB :(
    >
    > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
    > power of two.
    >
    > [ 60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
    > [ 60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
    > [ 60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
    > [ 60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
    > [ 60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
    > [ 60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
    > [ 60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
    > [ 60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2

    Signed-off-by: Tony Lindgren
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tony Lindgren
     

21 Oct, 2011

1 commit


20 Oct, 2011

1 commit

  • skb_recycle_check resets the skb if it's eligible for recycling.
    However, there are times when a driver might want to optionally
    manipulate the skb data with the skb before resetting the skb,
    but after it has determined eligibility. We do this by splitting the
    eligibility check from the skb reset, creating two inline functions to
    accomplish that task.

    Signed-off-by: Andy Fleming
    Acked-by: David Daney
    Signed-off-by: David S. Miller

    Andy Fleming
     

19 Oct, 2011

1 commit

  • To ease skb->truesize sanitization, its better to be able to localize
    all references to skb frags size.

    Define accessors : skb_frag_size() to fetch frag size, and
    skb_frag_size_{set|add|sub}() to manipulate it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Oct, 2011

1 commit

  • skb truesize currently accounts for sk_buff struct and part of skb head.
    kmalloc() roundings are also ignored.

    Considering that skb_shared_info is larger than sk_buff, its time to
    take it into account for better memory accounting.

    This patch introduces SKB_TRUESIZE(X) macro to centralize various
    assumptions into a single place.

    At skb alloc phase, we put skb_shared_info struct at the exact end of
    skb head, to allow a better use of memory (lowering number of
    reallocations), since kmalloc() gives us power-of-two memory blocks.

    Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
    aligned to cache lines, as before.

    Note: This patch might trigger performance regressions because of
    misconfigured protocol stacks, hitting per socket or global memory
    limits that were previously not reached. But its a necessary step for a
    more accurate memory accounting.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

16 Sep, 2011

1 commit

  • dev_forward_skb loops an skb back into host networking
    stack which might hang on the memory indefinitely.
    In particular, this can happen in macvtap in bridged mode.
    Copy the userspace fragments to avoid blocking the
    sender in that case.

    As this patch makes skb_copy_ubufs extern now,
    I also added some documentation and made it clear
    the SKBTX_DEV_ZEROCOPY flag automatically instead
    of doing it in all callers. This can be made into a separate
    patch if people feel it's worth it.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     

25 Aug, 2011

1 commit


21 Aug, 2011

1 commit


18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

02 Aug, 2011

1 commit


22 Jul, 2011

1 commit

  • There are two problems:
    1) "n" was allocated with alloc_skb() so we should free it with
    kfree_skb() instead of regular kfree().
    2) We return the freed pointer instead of NULL.

    Signed-off-by: Dan Carpenter
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Dan Carpenter
     

09 Jul, 2011

1 commit


07 Jul, 2011

1 commit

  • This patch adds userspace buffers support in skb shared info. A new
    struct skb_ubuf_info is needed to maintain the userspace buffers
    argument and index, a callback is used to notify userspace to release
    the buffers once lower device has done DMA (Last reference to that skb
    has gone).

    If there is any userspace apps to reference these userspace buffers,
    then these userspaces buffers will be copied into kernel. This way we
    can prevent userspace apps from holding these userspace buffers too long.

    Use destructor_arg to point to the userspace buffer info; a new tx flags
    SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.

    Signed-off-by: Shirley Ma
    Signed-off-by: David S. Miller

    Shirley Ma
     

21 May, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
    macvlan: fix panic if lowerdev in a bond
    tg3: Add braces around 5906 workaround.
    tg3: Fix NETIF_F_LOOPBACK error
    macvlan: remove one synchronize_rcu() call
    networking: NET_CLS_ROUTE4 depends on INET
    irda: Fix error propagation in ircomm_lmp_connect_response()
    irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
    irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
    be2net: Kill set but unused variable 'req' in lancer_fw_download()
    irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
    atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
    rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
    pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
    isdn: capi: Use pr_debug() instead of ifdefs.
    tg3: Update version to 3.119
    tg3: Apply rx_discards fix to 5719/5720
    ...

    Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
    as per Davem.

    Linus Torvalds
     
  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 May, 2011

1 commit

  • Commit 7fee226ad239 (add a noref bit on skb dst) forgot to use
    skb_dst_force() on packets queued in sk_error_queue

    This triggers following warning, for applications using IP_CMSG_PKTINFO
    receiving one error status

    ------------[ cut here ]------------
    WARNING: at include/linux/skbuff.h:457 ip_cmsg_recv_pktinfo+0xa6/0xb0()
    Hardware name: 2669UYD
    Modules linked in: isofs vboxnetadp vboxnetflt nfsd ebtable_nat ebtables
    lib80211_crypt_ccmp uinput xcbc hdaps tp_smapi thinkpad_ec radeonfb fb_ddc
    radeon ttm drm_kms_helper drm ipw2200 intel_agp intel_gtt libipw i2c_algo_bit
    i2c_i801 agpgart rng_core cfbfillrect cfbcopyarea cfbimgblt video raid10 raid1
    raid0 linear md_mod vboxdrv
    Pid: 4697, comm: miredo Not tainted 2.6.39-rc6-00569-g5895198-dirty #22
    Call Trace:
    [] ? printk+0x1d/0x1f
    [] warn_slowpath_common+0x72/0xa0
    [] ? ip_cmsg_recv_pktinfo+0xa6/0xb0
    [] ? ip_cmsg_recv_pktinfo+0xa6/0xb0
    [] warn_slowpath_null+0x20/0x30
    [] ip_cmsg_recv_pktinfo+0xa6/0xb0
    [] ip_cmsg_recv+0x127/0x260
    [] ? skb_dequeue+0x4d/0x70
    [] ? skb_copy_datagram_iovec+0x53/0x300
    [] ? sub_preempt_count+0x24/0x50
    [] ip_recv_error+0x23d/0x270
    [] udp_recvmsg+0x264/0x2b0
    [] inet_recvmsg+0xd9/0x130
    [] sock_recvmsg+0xf2/0x120
    [] ? might_fault+0x4b/0xa0
    [] ? verify_iovec+0x4c/0xc0
    [] ? sock_recvmsg_nosec+0x100/0x100
    [] __sys_recvmsg+0x114/0x1e0
    [] ? __lock_acquire+0x365/0x780
    [] ? fget_light+0xa6/0x3e0
    [] ? fget_light+0xbf/0x3e0
    [] ? fget_light+0x2e/0x3e0
    [] sys_recvmsg+0x39/0x60

    Close bug https://bugzilla.kernel.org/show_bug.cgi?id=34622

    Reported-by: Witold Baryluk
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Mar, 2011

1 commit


17 Mar, 2011

1 commit


02 Mar, 2011

1 commit

  • UFO doesn't really use the sk_sndmsg_* parameters so touching
    them is pointless. It can't use them anyway since the whole
    point of UFO is to use the original pages without copying.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Jan, 2011

2 commits


27 Jan, 2011

1 commit


25 Jan, 2011

2 commits

  • Quoting Ben Hutchings: we presumably won't be defining features that
    can only be enabled on 64-bit architectures.

    Occurences found by `grep -r` on net/, drivers/net, include/

    [ Move features and vlan_features next to each other in
    struct netdev, as per Eric Dumazet's suggestion -DaveM ]

    Signed-off-by: Michał Mirosław
    Signed-off-by: David S. Miller

    Michał Mirosław
     
  • Suppose that several linear skbs of the same flow were received by GRO. They
    were thus merged into one skb with a frag_list. Then a new skb of the same flow
    arrives, but it is a paged skb with data starting in its frags[].

    Before adding the skb to the frag_list skb_gro_receive() will of course adjust
    the skb to throw away the headers. It correctly modifies the page_offset and
    size of the frag, but it leaves incorrect information in the skb:
    ->data_len is not decreased at all.
    ->len is decreased only by headlen, as if no change were done to the frag.
    Later in a receiving process this causes skb_copy_datagram_iovec() to return
    -EFAULT and this is seen in userspace as the result of the recv() syscall.

    In practice the bug can be reproduced with the sfc driver. By default the
    driver uses an adaptive scheme when it switches between using
    napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is
    reproduced when under rx load with enough successful GRO merging the driver
    decides to switch from the former to the latter.

    Manual control is also possible, so reproducing this is easy with netcat:
    - on machine1 (with sfc): nc -l 12345 > /dev/null
    - on machine2: nc machine1 12345 < /dev/zero
    - on machine1:
    echo 1 > /sys/module/sfc/parameters/rx_alloc_method # use skbs
    echo 2 > /sys/module/sfc/parameters/rx_alloc_method # use pages
    - See that nc has quit suddenly.

    [v2: Modified by Eric Dumazet to avoid advancing skb->data past the end
    and to use a temporary variable.]

    Signed-off-by: Michal Schmidt
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Michal Schmidt
     

13 Jan, 2011

1 commit

  • The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
    failed to update the #ifdef stanzas guarding the defragmentation related
    fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.

    This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
    without connection tracking.

    Original report:
    http://marc.info/?l=linux-netdev&m=129010118516341&w=2

    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Signed-off-by: KOVACS Krisztian
    Signed-off-by: Pablo Neira Ayuso

    KOVACS Krisztian
     

17 Dec, 2010

1 commit


04 Dec, 2010

1 commit


24 Oct, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
    bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
    vlan: Calling vlan_hwaccel_do_receive() is always valid.
    tproxy: use the interface primary IP address as a default value for --on-ip
    tproxy: added IPv6 support to the socket match
    cxgb3: function namespace cleanup
    tproxy: added IPv6 support to the TPROXY target
    tproxy: added IPv6 socket lookup function to nf_tproxy_core
    be2net: Changes to use only priority codes allowed by f/w
    tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
    tproxy: added tproxy sockopt interface in the IPV6 layer
    tproxy: added udp6_lib_lookup function
    tproxy: added const specifiers to udp lookup functions
    tproxy: split off ipv6 defragmentation to a separate module
    l2tp: small cleanup
    nf_nat: restrict ICMP translation for embedded header
    can: mcp251x: fix generation of error frames
    can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
    can-raw: add msg_flags to distinguish local traffic
    9p: client code cleanup
    rds: make local functions/variables static
    ...

    Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
    drivers/net/wireless/ath/ath9k/debug.c as per David

    Linus Torvalds
     

17 Oct, 2010

1 commit

  • commit b30973f877 (node-aware skb allocation) spread a wrong habit of
    allocating net drivers skbs on a given memory node : The one closest to
    the NIC hardware. This is wrong because as soon as we try to scale
    network stack, we need to use many cpus to handle traffic and hit
    slub/slab management on cross-node allocations/frees when these cpus
    have to alloc/free skbs bound to a central node.

    skb allocated in RX path are ephemeral, they have a very short
    lifetime : Extra cost to maintain NUMA affinity is too expensive. What
    appeared as a nice idea four years ago is in fact a bad one.

    In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
    and two 10Gb NIC might deliver more than 28 million packets per second,
    needing all the available cpus.

    Cost of cross-node handling in network and vm stacks outperforms the
    small benefit hardware had when doing its DMA transfert in its 'local'
    memory node at RX time. Even trying to differentiate the two allocations
    done for one skb (the sk_buff on local node, the data part on NIC
    hardware node) is not enough to bring good performance.

    Signed-off-by: Eric Dumazet
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2010

1 commit


10 Sep, 2010

1 commit