23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2011

1 commit

  • This popped some compiler errors due to mismatched prototypes. Just
    remove most manual inlines, the compiler should be able to figure out
    what makes sense to inline and not.

    net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
    net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
    net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
    net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
    net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
    net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
    net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
    net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here

    Signed-off-by: Olof Johansson
    Cc: Chetan Loke
    Signed-off-by: David S. Miller

    Olof Johansson
     

19 Oct, 2011

1 commit

  • Fragmented multicast frames are delivered to a single macvlan port,
    because ip defrag logic considers other samples are redundant.

    Implement a defrag step before trying to send the multicast frame.

    Reported-by: Ben Greear
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Oct, 2011

1 commit


08 Oct, 2011

1 commit


04 Oct, 2011

1 commit

  • This is a minor change.

    Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS,
    ...) would return total and dropped packets since its last invocation. The
    introduction of socket queue overflow reporting [1] changed drop
    rate calculation in the normal packet socket path, but not when using a
    packet ring. As a result, the getsockopt now returns different statistics
    depending on the reception method used. With a ring, it still returns the
    count since the last call, as counts are incremented in tpacket_rcv and
    reset in getsockopt. Without a ring, it returns 0 if no drops occurred
    since the last getsockopt and the total drops over the lifespan of
    the socket otherwise. The culprit is this line in packet_rcv, executed
    on a drop:

    drop_n_acct:
    po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);

    As it shows, the new drop number it taken from the socket drop counter,
    which is not reset at getsockopt. I put together a small example
    that demonstrates the issue [2]. It runs for 10 seconds and overflows
    the queue/ring on every odd second. The reported drop rates are:
    ring: 16, 0, 16, 0, 16, ...
    non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74.

    Note how the even ring counts monotonically increase. Because the
    getsockopt adds tp_drops to tp_packets, total counts are similarly
    reported cumulatively. Long story short, reinstating the original code, as
    the below patch does, fixes the issue at the cost of additional per-packet
    cycles. Another solution that does not introduce per-packet overhead
    is be to keep the current data path, record the value of sk_drops at
    getsockopt() at call N in a new field in struct packetsock and subtract
    that when reporting at call N+1. I'll be happy to code that, instead,
    it's just more messy.

    [1] http://patchwork.ozlabs.org/patch/35665/
    [2] http://kernel.googlecode.com/files/test-packetsock-getstatistics.c

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

16 Sep, 2011

1 commit

  • This patch does several things:
    - introduces __ethtool_get_settings which is called from ethtool code and
    from drivers as well. Put ASSERT_RTNL there.
    - dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
    - changes calling in drivers so rtnl locking is respected. In
    iboe_get_rate was previously ->get_settings() called unlocked. This
    fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
    problem. Also fixed by calling __dev_get_by_index() instead of
    dev_get_by_index() and holding rtnl_lock for both calls.
    - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
    so bnx2fc_if_create() and fcoe_if_create() are called locked as they
    are from other places.
    - use __ethtool_get_settings() in bonding code

    Signed-off-by: Jiri Pirko

    v2->v3:
    -removed dev_ethtool_get_settings()
    -added ASSERT_RTNL into __ethtool_get_settings()
    -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
    around it and __ethtool_get_settings() call
    v1->v2:
    add missing export_symbol
    Reviewed-by: Ben Hutchings [except FCoE bits]
    Acked-by: Ralf Baechle
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Aug, 2011

1 commit


25 Aug, 2011

1 commit

  • 1) Blocks can be configured with non-static frame-size.
    2) Read/poll is at a block-level(as opposed to packet-level).
    3) Added poll timeout to avoid indefinite user-space wait on idle links.
    4) Added user-configurable knobs:
    4.1) block::timeout.
    4.2) tpkt_hdr::sk_rxhash.

    Changes:
    C1) tpacket_rcv()
    C1.1) packet_current_frame() is replaced by packet_current_rx_frame()
    The bulk of the processing is then moved in the following chain:
    packet_current_rx_frame()
    __packet_lookup_frame_in_block
    fill_curr_block()
    or
    retire_current_block
    dispatch_next_block
    or
    return NULL(queue is plugged/paused)

    Signed-off-by: Chetan Loke
    Signed-off-by: David S. Miller

    chetan loke
     

14 Jul, 2011

1 commit

  • Currently we flush tp_status and then flush the remainder of the header+payload.
    tp_status should be flushed in the end to avoid stale data being read by user-space.

    Incorrectly re-ordered barriers in v1.

    Signed-off-by: Chetan Loke
    Signed-off-by: David S. Miller

    Chetan Loke
     

07 Jul, 2011

2 commits


06 Jul, 2011

5 commits


21 Jun, 2011

1 commit


12 Jun, 2011

1 commit

  • There's no need for the guest to validate the checksum if it have been
    validated by host nics. So this patch introduces a new flag -
    VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
    examing in guest. The backend (tap/macvtap) may set this flag when
    met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

    No feature negotiation is needed as old driver just ignore this flag.

    Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
    when gro is on no difference as it produces skb with partial
    checksum. But when gro is disabled, 20% or even higher improvement
    could be measured by netperf.

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

07 Jun, 2011

1 commit

  • In 2.6.27, commit 393e52e33c6c2 (packet: deliver VLAN TCI to userspace)
    added a small information leak.

    Add padding field and make sure its zeroed before copy to user.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jun, 2011

2 commits


02 Jun, 2011

1 commit

  • Currently, user-space cannot determine if a 0 tcp_vlan_tci
    means there is no VLAN tag or the VLAN ID was zero.

    Add flag to make this explicit. User-space can check for
    TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
    compatible. Older could would have just checked for tp_vlan_tci,
    so it will work no worse than before.

    Signed-off-by: Ben Greear
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ben Greear
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

28 Apr, 2011

1 commit

  • In order to speedup packet filtering, here is an implementation of a
    JIT compiler for x86_64

    It is disabled by default, and must be enabled by the admin.

    echo 1 >/proc/sys/net/core/bpf_jit_enable

    It uses module_alloc() and module_free() to get memory in the 2GB text
    kernel range since we call helpers functions from the generated code.

    EAX : BPF A accumulator
    EBX : BPF X accumulator
    RDI : pointer to skb (first argument given to JIT function)
    RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
    r9d : skb->len - skb->data_len (headlen)
    r8 : skb->data

    To get a trace of generated code, use :

    echo 2 >/proc/sys/net/core/bpf_jit_enable

    Example of generated code :

    # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

    flen=18 proglen=147 pass=3 image=ffffffffa00b5000
    JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
    JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
    JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
    JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
    JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
    JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
    JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
    JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
    JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
    JIT code: ffffffffa00b5090: c0 c9 c3

    BPF program is 144 bytes long, so native program is almost same size ;)

    (000) ldh [12]
    (001) jeq #0x800 jt 2 jf 8
    (002) ld [26]
    (003) and #0xffffff00
    (004) jeq #0xc0a81400 jt 16 jf 5
    (005) ld [30]
    (006) and #0xffffff00
    (007) jeq #0xc0a81400 jt 16 jf 17
    (008) jeq #0x806 jt 10 jf 9
    (009) jeq #0x8035 jt 10 jf 17
    (010) ld [28]
    (011) and #0xffffff00
    (012) jeq #0xc0a81400 jt 16 jf 13
    (013) ld [38]
    (014) and #0xffffff00
    (015) jeq #0xc0a81400 jt 16 jf 17
    (016) ret #65535
    (017) ret #0

    Signed-off-by: Eric Dumazet
    Cc: Arnaldo Carvalho de Melo
    Cc: Ben Hutchings
    Cc: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2011

1 commit


12 Feb, 2011

1 commit


20 Jan, 2011

1 commit

  • Clean up some unused macros in net/*.
    1. be left for code change. e.g. PGV_FROM_VMALLOC, PGV_FROM_VMALLOC, KMEM_SAFETYZONE.
    2. never be used since introduced to kernel.
    e.g. P9_RDMA_MAX_SGE, UTIL_CTRL_PKT_SIZE.

    Signed-off-by: Shan Wei
    Acked-by: Sjur Braendeland
    Signed-off-by: David S. Miller

    Shan Wei
     

19 Jan, 2011

1 commit


17 Dec, 2010

1 commit


11 Dec, 2010

1 commit


09 Dec, 2010

3 commits


07 Dec, 2010

2 commits

  • As we can check if an address is vmalloc address with is_vmalloc_addr(),
    we remove pgv.flags. Then we may get more pg_vecs.

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     
  • The following commit causes the pgv->buffer may point to the memory
    returned by vmalloc(). And we can't use virt_to_page() for the vmalloc
    address.

    This patch introduces a new inline function pgv_to_page(), which calls
    vmalloc_to_page() for the vmalloc address, and virt_to_page() for the
    __get_free_pages address.

    We used to increase page pointer to get the next page at the next page
    address, after Neil's patch, it is wrong, as the physical address may
    be not continuous. This patch also fixes this issue.

    commit 0e3125c755445664f00ad036e4fc2cd32fd52877
    Author: Neil Horman
    Date: Tue Nov 16 10:26:47 2010 -0800

    packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4)

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     

22 Nov, 2010

1 commit

  • alloc_one_pg_vec_page() is supposed to return zeroed memory, so use
    vzalloc() instead of vmalloc()

    Signed-off-by: Eric Dumazet
    Cc: Neil Horman
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Nov, 2010

1 commit

  • Remove pc variable to avoid arithmetic to compute fentry at each filter
    instruction. Jumps directly manipulate fentry pointer.

    As the last instruction of filter[] is guaranteed to be a RETURN, and
    all jumps are before the last instruction, we dont need to check filter
    bounds (number of instructions in filter array) at each iteration, so we
    remove it from sk_run_filter() params.

    On x86_32 remove f_k var introduced in commit 57fe93b374a6b871
    (filter: make sure filters dont read uninitialized memory)

    Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
    avoid too many ifdefs in this code.

    This helps compiler to use cpu registers to hold fentry and A
    accumulator.

    On x86_32, this saves 401 bytes, and more important, sk_run_filter()
    runs much faster because less register pressure (One less conditional
    branch per BPF instruction)

    # size net/core/filter.o net/core/filter_pre.o
    text data bss dec hex filename
    2948 0 0 2948 b84 net/core/filter.o
    3349 0 0 3349 d15 net/core/filter_pre.o

    on x86_64 :
    # size net/core/filter.o net/core/filter_pre.o
    text data bss dec hex filename
    5173 0 0 5173 1435 net/core/filter.o
    5224 0 0 5224 1468 net/core/filter_pre.o

    Signed-off-by: Eric Dumazet
    Acked-by: Changli Gao
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2010

1 commit

  • MIME-Version: 1.0
    Content-Type: text/plain; charset=UTF-8
    Content-Transfer-Encoding: 8bit

    Version 4 of this patch.

    Change notes:
    1) Removed extra memset. Didn't think kcalloc added a GFP_ZERO the way kzalloc did :)

    Summary:
    It was shown to me recently that systems under high load were driven very deep
    into swap when tcpdump was run. The reason this happened was because the
    AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
    application to specify how many entries an AF_PACKET socket will have and how
    large each entry will be. It seems the default setting for tcpdump is to set
    the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
    allocation. Thats difficult under good circumstances, and horrid under memory
    pressure.

    I thought it would be good to make that a bit more usable. I was going to do a
    simple conversion of the ring buffer from contigous pages to iovecs, but
    unfortunately, the metadata which AF_PACKET places in these buffers can easily
    span a page boundary, and given that these buffers get mapped into user space,
    and the data layout doesn't easily allow for a change to padding between frames
    to avoid that, a simple iovec change is just going to break user space ABI
    consistency.

    So I've done this, I've added a three tiered mechanism to the af_packet set_ring
    socket option. It attempts to allocate memory in the following order:

    1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
    digging into swap

    2) Using vmalloc

    3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
    needed to get the memory

    The effect is that we don't disturb the system as much when we're under load,
    while still being able to conduct tcpdumps effectively.

    Tested successfully by me.

    Signed-off-by: Neil Horman
    Acked-by: Eric Dumazet
    Acked-by: Maciej Żenczykowski
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Neil Horman
     

13 Nov, 2010

1 commit