25 Oct, 2011

13 commits


24 Oct, 2011

10 commits

  • David S. Miller
     
  • There is a long standing bug in linux tcp stack, about ACK messages sent
    on behalf of TIME_WAIT sockets.

    In the IP header of the ACK message, we choose to reflect TOS field of
    incoming message, and this might break some setups.

    Example of things that were broken :
    - Routing using TOS as a selector
    - Firewalls
    - Trafic classification / shaping

    We now remember in timewait structure the inet tos field and use it in
    ACK generation, and route lookup.

    Notes :
    - We still reflect incoming TOS in RST messages.
    - We could extend MuraliRaja Muniraju patch to report TOS value in
    netlink messages for TIME_WAIT sockets.
    - A patch is needed for IPv6

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Renato Westphal noticed that since commit a2835763e130c343ace5320c20d33c281e7097b7
    "rtnetlink: handle rtnl_link netlink notifications manually" was merged
    we no longer send a netlink message when a networking device is moved
    from one network namespace to another.

    Fix this by adding the missing manual notification in dev_change_net_namespaces.

    Since all network devices that are processed by dev_change_net_namspaces are
    in the initialized state the complicated tests that guard the manual
    rtmsg_ifinfo calls in rollback_registered and register_netdevice are
    unnecessary and we can just perform a plain notification.

    Cc: stable@kernel.org
    Tested-by: Renato Westphal
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is bug in commit 5e2b61f(ipv4: Remove flowi from struct rtable).
    It makes xfrm4_fill_dst() modify wrong data structure.

    Signed-off-by: Zheng Yan
    Reported-by: Kim Phillips
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yan, Zheng
     
  • If the device is down during suspend/resume, interrupts are enabled
    without a registered interrupt handler, causing a storm of
    unhandled interrupts until the IRQ is disabled because "nobody
    cared".

    Instead, check that the device is up before touching it in the
    suspend/resume code.

    Fixes https://bugzilla.kernel.org/show_bug.cgi?id=39112

    Helped-by: Adrian Chadd
    Helped-by: Mohammed Shafi
    Signed-off-by: Clemens Buchacher
    Signed-off-by: David S. Miller

    Clemens Buchacher
     
  • The commit f39925dbde7788cfb96419c0f092b086aa325c0f
    (ipv4: Cache learned redirect information in inetpeer.)
    removed some ICMP packet validations which are required by
    RFC 1122, section 3.2.2.2:
    ...
    A Redirect message SHOULD be silently discarded if the new
    gateway address it specifies is not on the same connected
    (sub-) net through which the Redirect arrived [INTRO:2,
    Appendix A], or if the source of the Redirect is not the
    current first-hop gateway for the specified destination (see
    Section 3.3.1).

    Signed-off-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Flavio Leitner
     
  • The pair of functions,

    * skb_clone_tx_timestamp()
    * skb_complete_tx_timestamp()

    were designed to allow timestamping in PHY devices. The first
    function, called during the MAC driver's hard_xmit method, identifies
    PTP protocol packets, clones them, and gives them to the PHY device
    driver. The PHY driver may hold onto the packet and deliver it at a
    later time using the second function, which adds the packet to the
    socket's error queue.

    As pointed out by Johannes, nothing prevents the socket from
    disappearing while the cloned packet is sitting in the PHY driver
    awaiting a timestamp. This patch fixes the issue by taking a reference
    on the socket for each such packet. In addition, the comments
    regarding the usage of these function are expanded to highlight the
    rule that PHY drivers must use skb_complete_tx_timestamp() to release
    the packet, in order to release the socket reference, too.

    These functions first appeared in v2.6.36.

    Reported-by: Johannes Berg
    Signed-off-by: Richard Cochran
    Cc:
    Signed-off-by: Eric Dumazet
    Reviewed-by: Johannes Berg
    Signed-off-by: David S. Miller

    Richard Cochran
     
  • Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
    const attributes to callers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Add support for reporting ring sizes via ethtool -g to the virtio_net
    driver.

    Signed-off-by: Rick Jones
    Acked-by: Rusty Russell
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Rick Jones
     
  • tcp_md5_hash_header() writes into skb header a temporary zero value,
    this might confuse other users of this area.

    Since tcphdr is small (20 bytes), copy it in a temporary variable and
    make the change in the copy.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Oct, 2011

4 commits

  • When I made class_attr_bonding_matters per network namespace and dynamically
    allocated I overlooked the need for calling sysfs_attr_init. Oops.

    This fixes the following lockdep splat:

    [ 5.749651] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    [ 5.749655] bonding: MII link monitoring set to 100 ms
    [ 5.749676] BUG: key f49a831c not in .data!
    [ 5.749677] ------------[ cut here ]------------
    [ 5.749752] WARNING: at kernel/lockdep.c:2897 lockdep_init_map+0x1c3/0x460()
    [ 5.749809] Hardware name: ProLiant BL460c G1
    [ 5.749862] Modules linked in: bonding(+)
    [ 5.749978] Pid: 3177, comm: modprobe Not tainted 3.1.0-rc9-02177-gf2d1a4e-dirty #1157
    [ 5.750066] Call Trace:
    [ 5.750120] [] ? printk+0x18/0x21
    [ 5.750176] [] warn_slowpath_common+0x6d/0xa0
    [ 5.750231] [] ? lockdep_init_map+0x1c3/0x460
    [ 5.750287] [] ? lockdep_init_map+0x1c3/0x460
    [ 5.750342] [] warn_slowpath_null+0x1d/0x20
    [ 5.750398] [] lockdep_init_map+0x1c3/0x460
    [ 5.750453] [] ? _raw_spin_unlock+0x1d/0x20
    [ 5.750510] [] ? sysfs_new_dirent+0x68/0x110
    [ 5.750565] [] sysfs_add_file_mode+0x8b/0xe0
    [ 5.750621] [] sysfs_add_file+0x13/0x20
    [ 5.750675] [] sysfs_create_file+0x1c/0x20
    [ 5.750737] [] class_create_file+0x19/0x20
    [ 5.750794] [] netdev_class_create_file+0xf/0x20
    [ 5.750853] [] bond_create_sysfs+0x44/0x90 [bonding]
    [ 5.750911] [] ? bond_create_proc_dir+0x1e/0x3e [bonding]
    [ 5.750970] [] bond_net_init+0x7e/0x87 [bonding]
    [ 5.751026] [] ? 0xf840ffff
    [ 5.751080] [] ops_init.clone.4+0xba/0x100
    [ 5.751135] [] ? register_pernet_subsys+0x12/0x30
    [ 5.751191] [] register_pernet_operations.clone.3+0x43/0x80
    [ 5.751249] [] register_pernet_subsys+0x19/0x30
    [ 5.751306] [] bonding_init+0x832/0x8a2 [bonding]
    [ 5.751363] [] do_one_initcall+0x30/0x160
    [ 5.751420] [] ? bond_net_init+0x87/0x87 [bonding]
    [ 5.751477] [] sys_init_module+0xef/0x1890
    [ 5.751533] [] sysenter_do_call+0x12/0x36
    [ 5.751588] ---[ end trace 89f492d83a7f5006 ]---

    Signed-off-by: Eric W. Biederman
    Reported-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Ari got kernel panics using tg3 NIC, and bisected to 2669069aacc9 "tg3:
    enable transmit time stamping."

    This is because tigon3_dma_hwbug_workaround() might alloc a new skb and
    free the original. We panic when skb_tx_timestamp() is called on freed
    skb.

    Reported-by: Ari Savolainen
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • INET_ECN_encapsulate() is better understood if we can read the official
    statement.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     

21 Oct, 2011

13 commits

  • Due to a hardware problem, writes to the VFTA register can
    theoretically fail. Although the likelihood of this is very low.
    This patch adds a shadow vfta in the adapter struct for reading
    and adds new write functions for these devices to work around the problem.

    Signed-off-by: Carolyn Wyborny
    Tested-by: Aaron Brown
    Signed-off-by: Jeff Kirsher

    Carolyn Wyborny
     
  • This patch moves the DMA Coalescing feature initialization code from
    igb_reset to a new function and replaces it with a call to the new
    function.

    Signed-off-by: Carolyn Wyborny
    Tested-by: Aaron Brown
    Signed-off-by: Jeff Kirsher

    Carolyn Wyborny
     
  • In 82580 and later devices, the alternate MAC address feature is
    completely handled by the option ROM and software does not handle
    it anymore. This patch changes the check_alt_mac_addr function to
    exit immediately if device is 82580 or later.

    Signed-off-by: Carolyn Wyborny
    Signed-off-by: Jeff Kirsher

    Carolyn Wyborny
     
  • Signed-off-by: Mitch Williams
    Tested-by: Sibai Li
    Signed-off-by: Jeff Kirsher

    Williams, Mitch A
     
  • Update adapter identification strings to properly indicate i350 VF devices
    in the VF driver. Change the driver ID string to remove 82576-specific
    wording. Update copyright date.

    Signed-off-by: Mitch Williams
    Tested-by: Sibai Li
    Signed-off-by: Jeff Kirsher

    Williams, Mitch A
     
  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of using the dev->next chain and trying to resync at each call to
    dev_seq_start, use the name hash, keeping the bucket and the offset in
    seq->private field.

    Tests revealed the following results for ifconfig > /dev/null
    * 1000 interfaces:
    * 0.114s without patch
    * 0.089s with patch
    * 3000 interfaces:
    * 0.489s without patch
    * 0.110s with patch
    * 5000 interfaces:
    * 1.363s without patch
    * 0.250s with patch
    * 128000 interfaces (other setup):
    * ~100s without patch
    * ~30s with patch

    Signed-off-by: Mihai Maruseac
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Mihai Maruseac
     
  • On systems that create and delete lots of dynamic devices the
    31bit linux ifindex fails to fit in the 16bit macvtap minor,
    resulting in unusable macvtap devices. I have systems running
    automated tests that that hit this condition in just a few days.

    Use a linux idr allocator to track which mavtap minor numbers
    are available and and to track the association between macvtap
    minor numbers and macvtap network devices.

    Remove the unnecessary unneccessary check to see if the network
    device we have found is indeed a macvtap device. With macvtap
    specific data structures it is impossible to find any other
    kind of networking device.

    Increase the macvtap minor range from 65536 to the full 20 bits
    that is supported by linux device numbers. It doesn't solve the
    original problem but there is no penalty for a larger minor
    device range.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Place macvlan_common_newlink at the end of macvtap_newlink because
    failing in newlink after registering your network device is not
    supported.

    Move device_create into a netdevice creation notifier. The network device
    notifier is the only hook that is called after the network device has been
    registered with the device layer and before register_network_device returns
    success.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To avoid leaking packets in the receive queue. Add a socket destructor
    that will run whenever destroy a macvtap socket.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To see if it is appropriate to enable the macvtap zero copy feature
    don't test the lowerdev network device flags. Instead test the
    macvtap network device flags which are a direct copy of the lowerdev
    flags. This is important because nothing holds a reference to lowerdev
    and on a very bad day we lowerdev could be a pointer to stale memory.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is a small window in macvtap_open between looking up a
    networking device and calling macvtap_set_queue in which
    macvtap_del_queues called from macvtap_dellink. After
    calling macvtap_del_queues it is totally incorrect to
    allow macvtap_set_queue to proceed so prevent success by
    reporting that all of the available queues are in use.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • We must account in skb->truesize, the size of the fragments, not the
    used part of them.

    Doing this work is important to avoid unexpected OOM situations.

    Signed-off-by: Eric Dumazet
    CC: Rusty Russell
    CC: "Michael S. Tsirkin"
    CC: virtualization@lists.linux-foundation.org
    CC: Krishna Kumar
    Signed-off-by: David S. Miller

    Eric Dumazet