03 Aug, 2010

2 commits

  • This reverts commit 15e83ed78864d0625e87a85f09b297c0919a4797.

    As explained by Johannes Berg, the optimization made here is
    invalid. Or, at best, incomplete.

    Not only destructor invocation, but conntract entry releasing
    must be executed outside of hw IRQ context.

    So just checking "skb->destructor" is insufficient.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan
    hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     

01 Aug, 2010

1 commit

  • If user misconfigures ingress and causes a redirection loop, don't
    overwhelm the log. This is also a error case so make it unlikely.
    Found by inspection, luckily not in real system.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

28 Jul, 2010

1 commit


27 Jul, 2010

1 commit


26 Jul, 2010

1 commit

  • With conn-track zones and probably with different network
    namespaces, the netfilter logic needs to be re-calculated
    on packet receive. If the netfilter logic is not reset,
    it will not be recalculated properly. This patch adds
    the nf_reset logic to dev_forward_skb.

    Signed-off-by: Ben Greear
    Signed-off-by: David S. Miller

    Ben Greear
     

25 Jul, 2010

2 commits

  • Move frags[] at the end of struct skb_shared_info, and make
    pskb_expand_head() copy only the used part of it instead of whole array.

    This should avoid kmemcheck warnings and speedup pskb_expand_head() as
    well, avoiding a lot of cache misses.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Add addr_assign_type to struct net_device and expose it via sysfs.
    This new attribute has the purpose of giving user-space the ability to
    distinguish between different assignment types of MAC addresses.

    For example user-space can treat NICs with randomly generated MAC
    addresses differently than NICs that have permanent (locally assigned)
    MAC addresses.
    For the former udev could write a persistent net rule by matching the
    device path instead of the MAC address.
    There's also the case of devices that 'steal' MAC addresses from slave
    devices. In which it is also be beneficial for user-space to be aware
    of the fact.

    This patch also introduces a helper function to assist adoption of
    drivers that generate MAC addresses randomly.

    Signed-off-by: Stefan Assmann
    Signed-off-by: David S. Miller

    Stefan Assmann
     

24 Jul, 2010

1 commit


23 Jul, 2010

2 commits

  • It should only be adjusted if ip_summed == CHECKSUM_PARTIAL.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Make pskb_expand_head() check ip_summed to make sure csum_start is really
    csum_start and not csum before adjusting it.

    This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs.
    On my configuration, the sunhme driver produces skbs with differing amounts
    of headroom on receive depending on the packet size. See line 2030 of
    drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes
    of headroom but packets larger than that cutoff have only 20 bytes.

    When these packets reach the VLAN driver, vlan_check_reorder_header()
    calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes
    of headroom, uses pskb_expand_head() to make more.

    Then, pskb_expand_head() needs to adjust a lot of offsets into the skb,
    including csum_start. Since csum_start is a union with csum, if the packet
    has a valid csum value this will corrupt it, which was the effect I observed.
    The sunhme hardware computes receive checksums, so the skbs would be created
    by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and
    then pskb_expand_head() would corrupt the csum field, leading to an "hw csum
    error" message later on, for example in icmp_rcv() for pings larger than the
    sunhme RX_COPY_THRESHOLD.

    On the basis of the comment at the beginning of include/linux/skbuff.h,
    I believe that the csum_start skb field is only meaningful if ip_csummed is
    CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that
    case to avoid corrupting a valid csum value.

    Please see my more in-depth disucssion of tracking down this bug for
    more details if you like:

    http://puellavulnerata.livejournal.com/112186.html
    http://puellavulnerata.livejournal.com/112567.html
    http://puellavulnerata.livejournal.com/112891.html
    http://puellavulnerata.livejournal.com/113096.html
    http://puellavulnerata.livejournal.com/113591.html

    I am not subscribed to this list, so please CC me on replies.

    Signed-off-by: Andrea Shepard
    Signed-off-by: David S. Miller

    Andrea Shepard
     

21 Jul, 2010

3 commits


20 Jul, 2010

2 commits

  • Use modern this_cpu_xxx() api, saving few bytes on x86

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already
    protected by _xmit_lock, its easy to convert these fields to u64 instead
    of unsigned long.
    This completes 64bit stats for devices using them (vlan, macvlan, ...)

    Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit
    arches, but its slow path and we prefer keep it simple.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Jul, 2010

1 commit

  • This patch adds a new networking option to allow hardware time stamps
    from PHY devices. When enabled, likely candidates among incoming and
    outgoing network packets are offered to the PHY driver for possible
    time stamping. When accepted by the PHY driver, incoming packets are
    deferred for later delivery by the driver.

    The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
    and callbacks for transmit and receive time stamping. Drivers may
    optionally implement these functions.

    Signed-off-by: Richard Cochran
    Signed-off-by: David S. Miller

    Richard Cochran
     

15 Jul, 2010

3 commits

  • Fix problem in reading the tx_queue recorded in a socket. In
    dev_pick_tx, the TX queue is read by doing a check with
    sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
    The problem is that there is not mutual exclusion across these
    calls in the socket so it it is possible that the queue in the
    sock can be invalidated after sk_tx_queue_recorded is called so
    that sk_tx_queue get returns -1, which sets 65535 in queue_index
    and thus dev_pick_tx returns 65536 which is a bogus queue and
    can cause crash in dev_queue_xmit.

    We fix this by only calling sk_tx_queue_get which does the proper
    checks. The interface is that sk_tx_queue_get returns the TX queue
    if the sock argument is non-NULL and TX queue is recorded, else it
    returns -1. sk_tx_queue_recorded is no longer used so it can be
    completely removed.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • When configuring DMVPN (GRE + openNHRP) and a GRE remote
    address is configured a kernel Oops is observed. The
    obserseved Oops is caused by a NULL header_ops pointer
    (neigh->dev->header_ops) in neigh_update_hhs() when

    void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
    = neigh->dev->header_ops->cache_update;

    is executed. The dev associated with the NULL header_ops is
    the GRE interface. This patch guards against the
    possibility that header_ops is NULL.

    This Oops was first observed in kernel version 2.6.26.8.

    Signed-off-by: Doug Kehn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Doug Kehn
     
  • commit fc6055a5ba31e2 (net: Introduce skb_orphan_try()) added early
    orphaning of skbs.

    This unfortunately added a performance regression in skb_tx_hash() in
    case of stacked devices (bonding, vlans, ...)

    Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
    spread tx packets to multiple NIC queues on multiqueue devices.

    skb_tx_hash() in this case only uses skb->protocol, same value for all
    flows.

    skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
    can use this saved sk_hash value to compute its internal hash value.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Jul, 2010

3 commits


10 Jul, 2010

2 commits

  • Document that dev_get_stats() returns the same stats pointer it was
    given. Remove const qualification from the returned pointer since the
    caller may do what it likes with that structure.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • In commit be1f3c2c027cc5ad735df6a45a542ed1db7ec48b "net: Enable 64-bit
    net device statistics on 32-bit architectures" I redefined struct
    net_device_stats so that it could be used in a union with struct
    rtnl_link_stats64, avoiding the need for explicit copying or
    conversion between the two. However, this is unsafe because there is
    no locking required and no lock consistently held around calls to
    dev_get_stats() and use of the statistics structure it returns.

    In commit 28172739f0a276eb8d6ca917b3974c2edb036da3 "net: fix 64 bit
    counters on 32 bit arches" Eric Dumazet dealt with that problem by
    requiring callers of dev_get_stats() to provide storage for the
    result. This means that the net_device::stats64 field and the padding
    in struct net_device_stats are now redundant, so remove them.

    Update the comment on net_device_ops::ndo_get_stats64 to reflect its
    new usage.

    Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
    that is what all its callers are really using and it is no longer
    going to be compatible with struct net_device_stats.

    Eric Dumazet suggested the separate function for the structure
    conversion.

    Signed-off-by: Ben Hutchings
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ben Hutchings
     

08 Jul, 2010

2 commits


05 Jul, 2010

1 commit


03 Jul, 2010

1 commit

  • Reducing real_num_queues needs to flush the qdisc otherwise
    skbs with queue_mappings greater then real_num_tx_queues can
    be sent to the underlying driver.

    The flow for this is,

    dev_queue_xmit()
    dev_pick_tx()
    skb_tx_hash() => hash using real_num_tx_queues
    skb_set_queue_mapping()
    ...
    qdisc_enqueue_root() => enqueue skb on txq from hash
    ...
    dev->real_num_tx_queues -= n
    ...
    sch_direct_xmit()
    dev_hard_start_xmit()
    ndo_start_xmit(skb,dev) => skb queue set with old hash

    skbs are enqueued on the qdisc with skb->queue_mapping set
    0 < queue_mappings < real_num_tx_queues. When the driver
    decreases real_num_tx_queues skb's may be dequeued from the
    qdisc with a queue_mapping greater then real_num_tx_queues.

    This fixes a case in ixgbe where this was occurring with DCB
    and FCoE. Because the driver is using queue_mapping to map
    skbs to tx descriptor rings we can potentially map skbs to
    rings that no longer exist.

    Signed-off-by: John Fastabend
    Tested-by: Ross Brattain
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     

01 Jul, 2010

3 commits

  • Many NICs use an indirection table to map an RX flow hash value to one
    of an arbitrary number of queues (not necessarily a power of 2). It
    can be useful to remove some queues from this indirection table so
    that they are only used for flows that are specifically filtered
    there. It may also be useful to weight the mapping to account for
    user processes with the same CPU-affinity as the RX interrupts.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • ethtool_op_set_flags() does not check for unsupported flags, and has
    no way of doing so. This means it is not suitable for use as a
    default implementation of ethtool_ops::set_flags.

    Add a 'supported' parameter specifying the flags that the driver and
    hardware support, validate the requested flags against this, and
    change all current callers to pass this parameter.

    Change some other trivial implementations of ethtool_ops::set_flags to
    call ethtool_op_set_flags().

    Signed-off-by: Ben Hutchings
    Reviewed-by: Stanislaw Gruszka
    Acked-by: Jeff Garzik
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • This is only noticed by people that are not doing everything correct in
    the first place.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: David S. Miller

    Sebastian Andrzej Siewior
     

29 Jun, 2010

3 commits

  • struct ethtool_rxnfc was originally defined in 2.6.27 for the
    ETHTOOL_{G,S}RXFH command with only the cmd, flow_type and data
    fields. It was then extended in 2.6.30 to support various additional
    commands. These commands should have been defined to use a new
    structure, but it is too late to change that now.

    Since user-space may still be using the old structure definition
    for the ETHTOOL_{G,S}RXFH commands, and since they do not need the
    additional fields, only copy the originally defined fields to and
    from user-space.

    Signed-off-by: Ben Hutchings
    Cc: stable@kernel.org
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • On a 32-bit machine, info.rule_cnt >= 0x40000000 leads to integer
    overflow and the buffer may be smaller than needed. Since
    ETHTOOL_GRXCLSRLALL is unprivileged, this can presumably be used for at
    least denial of service.

    Signed-off-by: Ben Hutchings
    Cc: stable@kernel.org
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id())

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jun, 2010

2 commits

  • Add pr_fmt(fmt) KBUILD_MODNAME ": " fmt
    Remove "pktgen: " from formats
    Convert printks to pr_
    Added func_enter() for debugging
    Moved version to end of string at module_init
    Coalesced long formats

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • Gcc is currenlty not in the ability to optimize the switch statement in
    sk_run_filter() because of dense case labels. This patch replace the
    OR'd labels with ordered sequenced case labels. The sk_chk_filter()
    function is modified to patch/replace the original OPCODES in a
    ordered but equivalent form. gcc is now in the ability to transform the
    switch statement in sk_run_filter into a jump table of complexity O(1).

    Until this patch gcc generates a sequence of conditional branches (O(n) of 567
    byte .text segment size (arch x86_64):

    7ff: 8b 06 mov (%rsi),%eax
    801: 66 83 f8 35 cmp $0x35,%ax
    805: 0f 84 d0 02 00 00 je adb
    80b: 0f 87 07 01 00 00 ja 918
    811: 66 83 f8 15 cmp $0x15,%ax
    815: 0f 84 c5 02 00 00 je ae0
    81b: 77 73 ja 890
    81d: 66 83 f8 04 cmp $0x4,%ax
    821: 0f 84 17 02 00 00 je a3e
    827: 77 29 ja 852
    829: 66 83 f8 01 cmp $0x1,%ax
    [...]

    With the modification the compiler translate the switch statement into
    the following jump table fragment:

    7ff: 66 83 3e 2c cmpw $0x2c,(%rsi)
    803: 0f 87 1f 02 00 00 ja a28
    809: 0f b7 06 movzwl (%rsi),%eax
    80c: ff 24 c5 00 00 00 00 jmpq *0x0(,%rax,8)
    813: 44 89 e3 mov %r12d,%ebx
    816: e9 43 03 00 00 jmpq b5e
    81b: 41 89 dc mov %ebx,%r12d
    81e: e9 3b 03 00 00 jmpq b5e

    Furthermore, I reordered the instructions to reduce cache line misses by
    order the most common instruction to the start.

    Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

25 Jun, 2010

1 commit


24 Jun, 2010

1 commit

  • netif_needs_gso() is checked twice in the TX path once,
    before submitting the skb to the qdisc and once after
    it is dequeued from the qdisc just before calling
    ndo_hard_start(). This opens a window for a user to
    change the gso/tso or tx checksum settings that can
    cause netif_needs_gso to be true in one check and false
    in the other.

    Specifically, changing TX checksum setting may cause
    the warning in skb_gso_segment() to be triggered if
    the checksum is calculated earlier.

    This consolidates the netif_needs_gso() calls so that
    the stack only checks if gso is needed in
    dev_hard_start_xmit().

    Signed-off-by: John Fastabend
    Cc: Herbert Xu
    Signed-off-by: Jeff Kirsher
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    John Fastabend
     

17 Jun, 2010

1 commit