31 Dec, 2011

1 commit


28 Dec, 2011

1 commit


24 Dec, 2011

1 commit


23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Nov, 2011

2 commits

  • packet: Add needed_tailroom to packet_sendmsg_spkt

    While auditing LL_ALLOCATED_SPACE I noticed that packet_sendmsg_spkt
    did not include needed_tailroom when allocating an skb. This isn't
    a fatal error as we should always tolerate inadequate tail room but
    it isn't optimal.

    This patch fixes that.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • net: Remove all uses of LL_ALLOCATED_SPACE

    The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
    alignment to the sum of needed_headroom and needed_tailroom. As
    the amount that is then reserved for head room is needed_headroom
    with alignment, this means that the tail room left may be too small.

    This patch replaces all uses of LL_ALLOCATED_SPACE with the macro
    LL_RESERVED_SPACE and direct reference to needed_tailroom.

    This also fixes the problem with needed_headroom changing between
    allocating the skb and reserving the head room.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

04 Nov, 2011

1 commit

  • This popped some compiler errors due to mismatched prototypes. Just
    remove most manual inlines, the compiler should be able to figure out
    what makes sense to inline and not.

    net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
    net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
    net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
    net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
    net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
    net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
    net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
    net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here

    Signed-off-by: Olof Johansson
    Cc: Chetan Loke
    Signed-off-by: David S. Miller

    Olof Johansson
     

19 Oct, 2011

1 commit

  • Fragmented multicast frames are delivered to a single macvlan port,
    because ip defrag logic considers other samples are redundant.

    Implement a defrag step before trying to send the multicast frame.

    Reported-by: Ben Greear
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Oct, 2011

1 commit


08 Oct, 2011

1 commit


04 Oct, 2011

1 commit

  • This is a minor change.

    Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS,
    ...) would return total and dropped packets since its last invocation. The
    introduction of socket queue overflow reporting [1] changed drop
    rate calculation in the normal packet socket path, but not when using a
    packet ring. As a result, the getsockopt now returns different statistics
    depending on the reception method used. With a ring, it still returns the
    count since the last call, as counts are incremented in tpacket_rcv and
    reset in getsockopt. Without a ring, it returns 0 if no drops occurred
    since the last getsockopt and the total drops over the lifespan of
    the socket otherwise. The culprit is this line in packet_rcv, executed
    on a drop:

    drop_n_acct:
    po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);

    As it shows, the new drop number it taken from the socket drop counter,
    which is not reset at getsockopt. I put together a small example
    that demonstrates the issue [2]. It runs for 10 seconds and overflows
    the queue/ring on every odd second. The reported drop rates are:
    ring: 16, 0, 16, 0, 16, ...
    non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74.

    Note how the even ring counts monotonically increase. Because the
    getsockopt adds tp_drops to tp_packets, total counts are similarly
    reported cumulatively. Long story short, reinstating the original code, as
    the below patch does, fixes the issue at the cost of additional per-packet
    cycles. Another solution that does not introduce per-packet overhead
    is be to keep the current data path, record the value of sk_drops at
    getsockopt() at call N in a new field in struct packetsock and subtract
    that when reporting at call N+1. I'll be happy to code that, instead,
    it's just more messy.

    [1] http://patchwork.ozlabs.org/patch/35665/
    [2] http://kernel.googlecode.com/files/test-packetsock-getstatistics.c

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

16 Sep, 2011

1 commit

  • This patch does several things:
    - introduces __ethtool_get_settings which is called from ethtool code and
    from drivers as well. Put ASSERT_RTNL there.
    - dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
    - changes calling in drivers so rtnl locking is respected. In
    iboe_get_rate was previously ->get_settings() called unlocked. This
    fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
    problem. Also fixed by calling __dev_get_by_index() instead of
    dev_get_by_index() and holding rtnl_lock for both calls.
    - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
    so bnx2fc_if_create() and fcoe_if_create() are called locked as they
    are from other places.
    - use __ethtool_get_settings() in bonding code

    Signed-off-by: Jiri Pirko

    v2->v3:
    -removed dev_ethtool_get_settings()
    -added ASSERT_RTNL into __ethtool_get_settings()
    -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
    around it and __ethtool_get_settings() call
    v1->v2:
    add missing export_symbol
    Reviewed-by: Ben Hutchings [except FCoE bits]
    Acked-by: Ralf Baechle
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Aug, 2011

1 commit


25 Aug, 2011

1 commit

  • 1) Blocks can be configured with non-static frame-size.
    2) Read/poll is at a block-level(as opposed to packet-level).
    3) Added poll timeout to avoid indefinite user-space wait on idle links.
    4) Added user-configurable knobs:
    4.1) block::timeout.
    4.2) tpkt_hdr::sk_rxhash.

    Changes:
    C1) tpacket_rcv()
    C1.1) packet_current_frame() is replaced by packet_current_rx_frame()
    The bulk of the processing is then moved in the following chain:
    packet_current_rx_frame()
    __packet_lookup_frame_in_block
    fill_curr_block()
    or
    retire_current_block
    dispatch_next_block
    or
    return NULL(queue is plugged/paused)

    Signed-off-by: Chetan Loke
    Signed-off-by: David S. Miller

    chetan loke
     

14 Jul, 2011

1 commit

  • Currently we flush tp_status and then flush the remainder of the header+payload.
    tp_status should be flushed in the end to avoid stale data being read by user-space.

    Incorrectly re-ordered barriers in v1.

    Signed-off-by: Chetan Loke
    Signed-off-by: David S. Miller

    Chetan Loke
     

07 Jul, 2011

2 commits


06 Jul, 2011

5 commits


21 Jun, 2011

1 commit


12 Jun, 2011

1 commit

  • There's no need for the guest to validate the checksum if it have been
    validated by host nics. So this patch introduces a new flag -
    VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
    examing in guest. The backend (tap/macvtap) may set this flag when
    met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

    No feature negotiation is needed as old driver just ignore this flag.

    Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
    when gro is on no difference as it produces skb with partial
    checksum. But when gro is disabled, 20% or even higher improvement
    could be measured by netperf.

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

07 Jun, 2011

1 commit

  • In 2.6.27, commit 393e52e33c6c2 (packet: deliver VLAN TCI to userspace)
    added a small information leak.

    Add padding field and make sure its zeroed before copy to user.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jun, 2011

2 commits


02 Jun, 2011

1 commit

  • Currently, user-space cannot determine if a 0 tcp_vlan_tci
    means there is no VLAN tag or the VLAN ID was zero.

    Add flag to make this explicit. User-space can check for
    TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
    compatible. Older could would have just checked for tp_vlan_tci,
    so it will work no worse than before.

    Signed-off-by: Ben Greear
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ben Greear
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

28 Apr, 2011

1 commit

  • In order to speedup packet filtering, here is an implementation of a
    JIT compiler for x86_64

    It is disabled by default, and must be enabled by the admin.

    echo 1 >/proc/sys/net/core/bpf_jit_enable

    It uses module_alloc() and module_free() to get memory in the 2GB text
    kernel range since we call helpers functions from the generated code.

    EAX : BPF A accumulator
    EBX : BPF X accumulator
    RDI : pointer to skb (first argument given to JIT function)
    RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
    r9d : skb->len - skb->data_len (headlen)
    r8 : skb->data

    To get a trace of generated code, use :

    echo 2 >/proc/sys/net/core/bpf_jit_enable

    Example of generated code :

    # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

    flen=18 proglen=147 pass=3 image=ffffffffa00b5000
    JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
    JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
    JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
    JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
    JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
    JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
    JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
    JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
    JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
    JIT code: ffffffffa00b5090: c0 c9 c3

    BPF program is 144 bytes long, so native program is almost same size ;)

    (000) ldh [12]
    (001) jeq #0x800 jt 2 jf 8
    (002) ld [26]
    (003) and #0xffffff00
    (004) jeq #0xc0a81400 jt 16 jf 5
    (005) ld [30]
    (006) and #0xffffff00
    (007) jeq #0xc0a81400 jt 16 jf 17
    (008) jeq #0x806 jt 10 jf 9
    (009) jeq #0x8035 jt 10 jf 17
    (010) ld [28]
    (011) and #0xffffff00
    (012) jeq #0xc0a81400 jt 16 jf 13
    (013) ld [38]
    (014) and #0xffffff00
    (015) jeq #0xc0a81400 jt 16 jf 17
    (016) ret #65535
    (017) ret #0

    Signed-off-by: Eric Dumazet
    Cc: Arnaldo Carvalho de Melo
    Cc: Ben Hutchings
    Cc: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2011

1 commit


12 Feb, 2011

1 commit


20 Jan, 2011

1 commit

  • Clean up some unused macros in net/*.
    1. be left for code change. e.g. PGV_FROM_VMALLOC, PGV_FROM_VMALLOC, KMEM_SAFETYZONE.
    2. never be used since introduced to kernel.
    e.g. P9_RDMA_MAX_SGE, UTIL_CTRL_PKT_SIZE.

    Signed-off-by: Shan Wei
    Acked-by: Sjur Braendeland
    Signed-off-by: David S. Miller

    Shan Wei
     

19 Jan, 2011

1 commit


17 Dec, 2010

1 commit


11 Dec, 2010

1 commit


09 Dec, 2010

3 commits


07 Dec, 2010

1 commit