21 Feb, 2012

1 commit

  • Assorted fixes, sat in -next for a week or so...

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
    vfs: fix compat_sys_stat() handling of overflows in st_nlink
    quota: Fix deadlock with suspend and quotas
    vfs: Provide function to get superblock and wait for it to thaw
    vfs: fix panic in __d_lookup() with high dentry hashtable counts
    autofs4 - fix lockdep splat in autofs
    vfs: fix d_inode_lookup() dentry ref leak

    Linus Torvalds
     

16 Feb, 2012

1 commit


15 Feb, 2012

3 commits

  • commit 5a698af53f (bond: service netpoll arp queue on master device)
    tested IFF_SLAVE flag against dev->priv_flags instead of dev->flags

    Signed-off-by: Eric Dumazet
    Cc: WANG Cong
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The first parameter should be "number of elements" and the second parameter
    should be "element size".

    Signed-off-by: Axel Lin
    Acked-by: David Howells
    Signed-off-by: David S. Miller

    Axel Lin
     
  • This commit ensures that lost_cnt_hint is correctly updated in
    tcp_shifted_skb() for FACK TCP senders. The lost_cnt_hint adjustment
    in tcp_sacktag_one() only applies to non-FACK senders, so FACK senders
    need their own adjustment.

    This applies the spirit of 1e5289e121372a3494402b1b131b41bfe1cf9b7f -
    except now that the sequence range passed into tcp_sacktag_one() is
    correct we need only have a special case adjustment for FACK.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

13 Feb, 2012

2 commits

  • Fix the newly-SACKed range to be the range of newly-shifted bytes.

    Previously - since 832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 -
    tcp_shifted_skb() incorrectly called tcp_sacktag_one() with the start
    and end sequence numbers of the skb it passes in set to the range just
    beyond the range that is newly-SACKed.

    This commit also removes a special-case adjustment to lost_cnt_hint in
    tcp_shifted_skb() since the pre-existing adjustment of lost_cnt_hint
    in tcp_sacktag_one() now properly handles this things now that the
    correct start sequence number is passed in.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • This commit allows callers of tcp_sacktag_one() to pass in sequence
    ranges that do not align with skb boundaries, as tcp_shifted_skb()
    needs to do in an upcoming fix in this patch series.

    In fact, now tcp_sacktag_one() does not need to depend on an input skb
    at all, which makes its semantics and dependencies more clear.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

11 Feb, 2012

6 commits

  • Quoth David:

    1) GRO MAC header comparisons were ethernet specific, breaking other
    link types. This required a multi-faceted fix to cure the originally
    noted case (Infiniband), because IPoIB was lying about it's actual
    hard header length. Thanks to Eric Dumazet, Roland Dreier, and
    others.

    2) Fix build failure when INET_UDP_DIAG is built in and ipv6 is modular.
    From Anisse Astier.

    3) Off by ones and other bug fixes in netprio_cgroup from Neil Horman.

    4) ipv4 TCP reset generation needs to respect any network interface
    binding from the socket, otherwise route lookups might give a
    different result than all the other segments received. From Shawn
    Lu.

    5) Fix unintended regression in ipv4 proxy ARP responses, from Thomas
    Graf.

    6) Fix SKB under-allocation bug in sh_eth, from Yoshihiro Shimoda.

    7) Revert skge PCI mapping changes that are causing crashes for some
    folks, from Stephen Hemminger.

    8) IPV4 route lookups fill in the wildcarded fields of the given flow
    lookup key passed in, which is fine most of the time as this is
    exactly what the caller's want. However there are a few cases that
    want to retain the original flow key values afterwards, so handle
    those cases properly. Fix from Julian Anastasov.

    9) IGB/IXGBE VF lookup bug fixes from Greg Rose.

    10) Properly null terminate filename passed to ethtool flash device
    method, from Ben Hutchings.

    11) S3 resume fix in via-velocity from David Lv.

    12) Fix double SKB free during xmit failure in CAIF, from Dmitry
    Tarnyagin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
    net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabled
    ipv4: Fix wrong order of ip_rt_get_source() and update iph->daddr.
    netprio_cgroup: fix wrong memory access when NETPRIO_CGROUP=m
    netprio_cgroup: don't allocate prio table when a device is registered
    netprio_cgroup: fix an off-by-one bug
    bna: fix error handling of bnad_get_flash_partition_by_offset()
    isdn: type bug in isdn_net_header()
    net: Make qdisc_skb_cb upper size bound explicit.
    ixgbe: ethtool: stats user buffer overrun
    ixgbe: dcb: up2tc mapping lost on disable/enable CEE DCB state
    ixgbe: do not update real num queues when netdev is going away
    ixgbe: Fix broken dependency on MAX_SKB_FRAGS being related to page size
    ixgbe: Fix case of Tx Hang in PF with 32 VFs
    ixgbe: fix vf lookup
    igb: fix vf lookup
    e1000: add dropped DMA receive enable back in for WoL
    gro: more generic L2 header check
    IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses
    zd1211rw: firmware needs duration_id set to zero for non-pspoll frames
    net: enable TC35815 for MIPS again
    ...

    Linus Torvalds
     
  • Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed
    the behavior of arp proxy to send arp replies back out on the interface
    the request came in even if the private VLAN feature is disabled.

    Previously we checked rt->dst.dev != skb->dev for in scenarios, when
    proxy arp is enabled on for the netdevice and also when individual proxy
    neighbour entries have been added.

    This patch adds the check back for the pneigh_lookup() scenario.

    Signed-off-by: Thomas Graf
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This patch fix a bug which introduced by commit ac8a4810 (ipv4: Save
    nexthop address of LSRR/SSRR option to IPCB.).In that patch, we saved
    the nexthop of SRR in ip_option->nexthop and update iph->daddr until
    we get to ip_forward_options(), but we need to update it before
    ip_rt_get_source(), otherwise we may get a wrong src.

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller

    Li Wei
     
  • When the netprio_cgroup module is not loaded, net_prio_subsys_id
    is -1, and so sock_update_prioidx() accesses cgroup_subsys array
    with negative index subsys[-1].

    Make the code resembles cls_cgroup code, which is bug free.

    Origionally-authored-by: Li Zefan
    Signed-off-by: Li Zefan
    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     
  • So we delay the allocation till the priority is set through cgroup,
    and this makes skb_update_priority() faster when it's not set.

    This also eliminates an off-by-one bug similar with the one fixed
    in the previous patch.

    Origionally-authored-by: Li Zefan
    Signed-off-by: Li Zefan
    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     
  • # mount -t cgroup xxx /mnt
    # mkdir /mnt/tmp
    # cat /mnt/tmp/net_prio.ifpriomap
    lo 0
    eth0 0
    virbr0 0
    # echo 'lo 999' > /mnt/tmp/net_prio.ifpriomap
    # cat /mnt/tmp/net_prio.ifpriomap
    lo 999
    eth0 0
    virbr0 4101267344

    We got weired output, because we exceeded the boundary of the array.
    We may even crash the kernel..

    Origionally-authored-by: Li Zefan
    Signed-off-by: Li Zefan
    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

10 Feb, 2012

2 commits

  • read_lock(&tpt_trig->trig.leddev_list_lock) is accessed via the path
    ieee80211_open (->) ieee80211_do_open (->) ieee80211_mod_tpt_led_trig
    (->) ieee80211_start_tpt_led_trig (->) tpt_trig_timer before initializing
    it.
    the intilization of this read/write lock happens via the path
    ieee80211_led_init (->) led_trigger_register, but we are doing
    'ieee80211_led_init' after 'ieeee80211_if_add' where we
    register netdev_ops.
    so we access leddev_list_lock before initializing it and causes the
    following bug in chrome laptops with AR928X cards with the following
    script

    while true
    do
    sudo modprobe -v ath9k
    sleep 3
    sudo modprobe -r ath9k
    sleep 3
    done

    BUG: rwlock bad magic on CPU#1, wpa_supplicant/358, f5b9eccc
    Pid: 358, comm: wpa_supplicant Not tainted 3.0.13 #1
    Call Trace:

    [] rwlock_bug+0x3d/0x47
    [] do_raw_read_lock+0x19/0x29
    [] _raw_read_lock+0xd/0xf
    [] tpt_trig_timer+0xc3/0x145 [mac80211]
    [] ieee80211_mod_tpt_led_trig+0x152/0x174 [mac80211]
    [] ieee80211_do_open+0x11e/0x42e [mac80211]
    [] ? ieee80211_check_concurrent_iface+0x26/0x13c [mac80211]
    [] ieee80211_open+0x48/0x4c [mac80211]
    [] __dev_open+0x82/0xab
    [] __dev_change_flags+0x9c/0x113
    [] dev_change_flags+0x18/0x44
    [] devinet_ioctl+0x243/0x51a
    [] inet_ioctl+0x93/0xac
    [] sock_ioctl+0x1c6/0x1ea
    [] ? might_fault+0x20/0x20
    [] do_vfs_ioctl+0x46e/0x4a2
    [] ? fget_light+0x2f/0x70
    [] ? sys_recvmsg+0x3e/0x48
    [] sys_ioctl+0x46/0x69
    [] sysenter_do_call+0x12/0x2

    Cc:
    Cc: Gary Morain
    Cc: Paul Stewart
    Cc: Abhijit Pradhan
    Cc: Vasanthakumar Thiagarajan
    Cc: Rajkumar Manoharan
    Acked-by: Johannes Berg
    Tested-by: Mohammed Shafi Shajakhan
    Signed-off-by: Mohammed Shafi Shajakhan
    Signed-off-by: John W. Linville

    Mohammed Shafi Shajakhan
     
  • Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
    of other data structures.

    This is intended to be used by IPoIB so that it can remember
    addressing information stored at hard_header_ops->create() time that
    it can fetch when the packet gets to the transmit routine.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Feb, 2012

1 commit

  • Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
    only, and failed on IB/ipoib traffic.

    He provided a patch faking a zeroed header to let GRO aggregates frames.

    Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
    check to be more generic, ie not assuming L2 header is 14 bytes, but
    taking into account hard_header_len.

    __napi_gro_receive() has special handling for the common case (Ethernet)
    to avoid a memcmp() call and use an inline optimized function instead.

    Signed-off-by: Eric Dumazet
    Reported-by: Shlomo Pongratz
    Cc: Roland Dreier
    Cc: Or Gerlitz
    Cc: Herbert Xu
    Tested-by: Sean Hefty
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Feb, 2012

1 commit


05 Feb, 2012

2 commits

  • Binding RST packet outgoing interface to incoming interface
    for tcp v4 when there is no socket associate with it.
    when sk is not NULL, using sk->sk_bound_dev_if instead.
    (suggested by Eric Dumazet).

    This has few benefits:
    1. tcp_v6_send_reset already did that.
    2. This helps tcp connect with SO_BINDTODEVICE set. When
    connection is lost, we still able to sending out RST using
    same interface.
    3. we are sending reply, it is most likely to be succeed
    if iif is used

    Signed-off-by: Shawn Lu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shawn Lu
     
  • It was recently pointed out to me that the get_prioidx function sets a bit in
    the prioidx map prior to checking to see if the index being set is out of
    bounds. This patch corrects that, avoiding the possiblity of us writing beyond
    the end of the array

    Signed-off-by: Neil Horman
    Reported-by: Stanislaw Gruszka
    CC: Stanislaw Gruszka
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

04 Feb, 2012

1 commit


03 Feb, 2012

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: fix safety of rbd_put_client()
    rbd: fix a memory leak in rbd_get_client()
    ceph: create a new session lock to avoid lock inversion
    ceph: fix length validation in parse_reply_info()
    ceph: initialize client debugfs outside of monc->mutex
    ceph: change "ceph.layout" xattr to be "ceph.file.layout"

    Linus Torvalds
     
  • Initializing debufs under monc->mutex introduces a lock dependency for
    sb->s_type->i_mutex_key, which (combined with several other dependencies)
    leads to an annoying lockdep warning. There's no particular reason to do
    the debugfs setup under this lock, so move it out.

    It used to be the case that our first monmap could come from the OSD; that
    is no longer the case with recent servers, so we will reliably set up the
    client entry during the initial authentication.

    We don't have to worry about racing with debugfs teardown by
    ceph_debugfs_client_cleanup() because ceph_destroy_client() calls
    ceph_msgr_flush() first, which will wait for the message dispatch work
    to complete (and the debugfs init to complete).

    Fixes: #1940
    Signed-off-by: Sage Weil

    Sage Weil
     
  • SKB is freed twice upon send error. The Network stack consumes SKB even
    when it returns error code.

    Signed-off-by: Sjur Brændeland
    Signed-off-by: David S. Miller

    Dmitry Tarnyagin
     
  • Always use cfmuxl_remove_uplayer when removing a up-layer.
    cfmuxl_ctrlcmd() can be called independently and in parallel with
    cfmuxl_remove_uplayer(). The race between them could cause list_del_rcu
    to be called on a node which has been already taken out from the list.
    That lead to a (rare) crash on accessing poisoned node->prev inside
    list_del_rcu.

    This fix ensures that deletion are done holding the same lock.

    Reported-by: Dmitry Tarnyagin
    Signed-off-by: Sjur Brændeland
    Signed-off-by: David S. Miller

    sjur.brandeland@stericsson.com
     
  • Commit 4acb4190 tries to fix the using uninitialized value
    introduced by commit 3dc43e3, but it would make the
    per-socket memory limits too small.

    This patch fixes this and also remove the redundant codes
    introduced in 4acb4190.

    Signed-off-by: Jason Wang
    Acked-by: Glauber Costa
    Signed-off-by: David S. Miller

    Jason Wang
     

02 Feb, 2012

3 commits

  • The current code checks for stored_mpdu_num > 1, causing
    the reorder_timer to be triggered indefinitely, but the
    frame is never timed-out (until the next packet is received)

    Signed-off-by: Eliad Peller
    Cc:
    Acked-by: Johannes Berg
    Signed-off-by: John W. Linville

    Eliad Peller
     
  • The parameters for ETHTOOL_FLASHDEV include a filename, which ought to
    be null-terminated. Currently the only driver that implements
    ethtool_ops::flash_device attempts to add a null terminator if
    necessary, but does it wrongly. Do it in the ethtool core instead.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • Some of our machines were reporting:

    TCP: too many of orphaned sockets

    even when the number of orphaned sockets was well below the
    limit.

    We print a different message depending on whether we're out
    of TCP memory or there are too many orphaned sockets.

    Also move the check out of line and cleanup the messages
    that were printed.

    Signed-off-by: Arun Sharma
    Suggested-by: Mohan Srinivasan
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: David Miller
    Cc: Glauber Costa
    Cc: Ingo Molnar
    Cc: Joe Perches
    Signed-off-by: David S. Miller

    Arun Sharma
     

31 Jan, 2012

5 commits

  • 1) Setting link attributes can modify the size of the attributes that
    would be reported on a subsequent getlink netlink operation,
    therefore min_ifinfo_dump_size needs to be adjusted. From Stefan
    Gula.

    2) Resegmentation of TSO frames while trimming can violate invariants
    expected by callers, namely that the number of segments can only stay
    the same or decrease, never increase. If MSS changes, however, we
    can trim data but then end up with more segments. Fix this by only
    segmenting to the MSS already recorded in the SKB. That's the
    simplest fix for now and if we want to get more fancy in the future
    that's a more involved change.

    This probably explains some retransmit counter inaccuracies.

    From Neal Cardwell.

    3) Fix too-many-wakeups in POLL with AF_UNIX sockets, from Eric Dumazet.

    4) Fix CAIF crashes wrt. namespace handling. From Eric Dumazet and
    Eric W. Biederman.

    5) TCP port selection fixes from Flavio Leitner.

    6) More socket memory cgroup build fixes in certain randonfig
    situations. From Glauber Costa.

    7) Fix TCP memory sysctl regression reported by Ingo Molnar, also from
    Glauber Costa.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    af_unix: fix EPOLLET regression for stream sockets
    tcp: fix tcp_trim_head() to adjust segment count with skb MSS
    net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTL
    net caif: Register properly as a pernet subsystem.
    netns: Fail conspicously if someone uses net_generic at an inappropriate time.
    net: explicitly add jump_label.h header to sock.h
    net: RTNETLINK adjusting values of min_ifinfo_dump_size
    ipv6: Fix ip_gre lockless xmits.
    xen-netfront: correct MAX_TX_TARGET calculation.
    netns: fix net_alloc_generic()
    tcp: bind() optimize port allocation
    tcp: bind() fix autoselection to share ports
    l2tp: l2tp_ip - fix possible oops on packet receive
    iwlwifi: fix PCI-E transport "inta" race
    mac80211: set bss_conf.idle when vif is connected
    mac80211: update oper_channel on ibss join

    Linus Torvalds
     
  • Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
    a stream socket) added a regression for epoll() in Edge Triggered mode
    (EPOLLET)

    Appropriate fix is to use skb_peek()/skb_unlink() instead of
    skb_dequeue(), and only call skb_unlink() when skb is fully consumed.

    This remove the need to requeue a partial skb into sk_receive_queue head
    and the extra sk->sk_data_ready() calls that added the regression.

    This is safe because once skb is given to sk_receive_queue, it is not
    modified by a writer, and readers are serialized by u->readlock mutex.

    This also reduce number of spinlock acquisition for small reads or
    MSG_PEEK users so should improve overall performance.

    Reported-by: Nick Mathewson
    Signed-off-by: Eric Dumazet
    Cc: Alexey Moiseytsev
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This commit fixes tcp_trim_head() to recalculate the number of
    segments in the skb with the skb's existing MSS, so trimming the head
    causes the skb segment count to be monotonically non-increasing - it
    should stay the same or go down, but not increase.

    Previously tcp_trim_head() used the current MSS of the connection. But
    if there was a decrease in MSS between original transmission and ACK
    (e.g. due to PMTUD), this could cause tcp_trim_head() to
    counter-intuitively increase the segment count when trimming bytes off
    the head of an skb. This violated assumptions in tcp_tso_acked() that
    tcp_trim_head() only decreases the packet count, so that packets_acked
    in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to
    pass u32 pkts_acked values as large as 0xffffffff to
    ca_ops->pkts_acked().

    As an aside, if tcp_trim_head() had really wanted the skb to reflect
    the current MSS, it should have called tcp_set_skb_tso_segs()
    unconditionally, since a decrease in MSS would mean that a
    single-packet skb should now be sliced into multiple segments.

    Signed-off-by: Neal Cardwell
    Acked-by: Nandita Dukkipati
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • sysctl_tcp_mem() initialization was moved to sysctl_tcp_ipv4.c
    in commit 3dc43e3e4d0b52197d3205214fe8f162f9e0c334, since it
    became a per-ns value.

    That code, however, will never run when CONFIG_SYSCTL is
    disabled, leading to bogus values on those fields - causing hung
    TCP sockets.

    This patch fixes it by keeping an initialization code in
    tcp_init(). It will be overwritten by the first net namespace
    init if CONFIG_SYSCTL is compiled in, and do the right thing if
    it is compiled out.

    It is also named properly as tcp_init_mem(), to properly signal
    its non-sysctl side effect on TCP limits.

    Reported-by: Ingo Molnar
    Signed-off-by: Glauber Costa
    Cc: David S. Miller
    Link: http://lkml.kernel.org/r/4F22D05A.8030604@parallels.com
    [ renamed the function, tidied up the changelog a bit ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • NFS client bugfixes for Linux 3.3 (pull 3)

    * tag 'nfs-for-3.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    SUNRPC: Fix machine creds in generic_create_cred and generic_match

    Linus Torvalds
     

28 Jan, 2012

2 commits

  • caif is a subsystem and as such it needs to register with
    register_pernet_subsys instead of register_pernet_device.

    Among other problems using register_pernet_device was resulting in
    net_generic being called before the caif_net structure was allocated.
    Which has been causing net_generic to fail with either BUG_ON's or by
    return NULL pointers.

    A more ugly problem that could be caused is packets in flight why the
    subsystem is shutting down.

    To remove confusion also remove the cruft cause by inappropriately
    trying to fix this bug.

    With the aid of the previous patch I have tested this patch and
    confirmed that using register_pernet_subsys makes the failure go away as
    it should.

    Signed-off-by: Eric W. Biederman
    Acked-by: Sjur Brændeland
    Tested-by: Sasha Levin
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • David S. Miller
     

27 Jan, 2012

3 commits

  • Setting link parameters on a netdevice changes the value
    of if_nlmsg_size(), therefore it is necessary to recalculate
    min_ifinfo_dump_size.

    Signed-off-by: Stefan Gula
    Signed-off-by: David S. Miller

    Stefan Gula
     
  • Tunnel devices set NETIF_F_LLTX to bypass HARD_TX_LOCK. Sit and
    ipip set this unconditionally in ops->setup, but gre enables it
    conditionally after parameter passing in ops->newlink. This is
    not called during tunnel setup as below, however, so GRE tunnels are
    still taking the lock.

    modprobe ip_gre
    ip tunnel add test0 mode gre remote 10.5.1.1 dev lo
    ip link set test0 up
    ip addr add 10.6.0.1 dev test0
    # cat /sys/class/net/test0/features
    # $DIR/test_tunnel_xmit 10 10.5.2.1
    ip route add 10.5.2.0/24 dev test0
    ip tunnel del test0

    The newlink callback is only called in rtnl_netlink, and only if
    the device is new, as it calls register_netdevice internally. Gre
    tunnels are created at 'ip tunnel add' with ioctl SIOCADDTUNNEL,
    which calls ipgre_tunnel_locate, which calls register_netdev.
    rtnl_newlink is called at 'ip link set', but skips ops->newlink
    and the device is up with locking still enabled. The equivalent
    ipip tunnel works fine, btw (just substitute 'method gre' for
    'method ipip').

    On kernels before /sys/class/net/*/features was removed [1],
    the first commented out line returns 0x6000 with method gre,
    which indicates that NETIF_F_LLTX (0x1000) is not set. With ipip,
    it reports 0x7000. This test cannot be used on recent kernels where
    the sysfs file is removed (and ETHTOOL_GFEATURES does not currently
    work for tunnel devices, because they lack dev->ethtool_ops).

    The second commented out line calls a simple transmission test [2]
    that sends on 24 cores at maximum rate. Results of a single run:

    ipip: 19,372,306
    gre before patch: 4,839,753
    gre after patch: 19,133,873

    This patch replicates the condition check in ipgre_newlink to
    ipgre_tunnel_locate. It works for me, both with oseq on and off.
    This is the first time I looked at rtnetlink and iproute2 code,
    though, so someone more knowledgeable should probably check the
    patch. Thanks.

    The tail of both functions is now identical, by the way. To avoid
    code duplication, I'll be happy to rework this and merge the two.

    [1] http://patchwork.ozlabs.org/patch/104610/
    [2] http://kernel.googlecode.com/files/xmit_udp_parallel.c

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • When a new net namespace is created, we should attach to it a "struct
    net_generic" with enough slots (even empty), or we can hit the following
    BUG_ON() :

    [ 200.752016] kernel BUG at include/net/netns/generic.h:40!
    ...
    [ 200.752016] [] ? get_cfcnfg+0x3a/0x180
    [ 200.752016] [] ? lockdep_rtnl_is_held+0x10/0x20
    [ 200.752016] [] caif_device_notify+0x2e/0x530
    [ 200.752016] [] notifier_call_chain+0x67/0x110
    [ 200.752016] [] raw_notifier_call_chain+0x11/0x20
    [ 200.752016] [] call_netdevice_notifiers+0x32/0x60
    [ 200.752016] [] register_netdevice+0x196/0x300
    [ 200.752016] [] register_netdev+0x19/0x30
    [ 200.752016] [] loopback_net_init+0x4a/0xa0
    [ 200.752016] [] ops_init+0x42/0x180
    [ 200.752016] [] setup_net+0x6b/0x100
    [ 200.752016] [] copy_net_ns+0x86/0x110
    [ 200.752016] [] create_new_namespaces+0xd9/0x190

    net_alloc_generic() should take into account the maximum index into the
    ptr array, as a subsystem might use net_generic() anytime.

    This also reduces number of reallocations in net_assign_generic()

    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Signed-off-by: Eric Dumazet
    Cc: Sjur Brændeland
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jan, 2012

1 commit

  • Port autoselection finds a port and then drop the lock,
    then right after that, gets the hash bucket again and lock it.

    Fix it to go direct.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Flavio Leitner