29 Sep, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
    tcp: Fix >4GB writes on 64-bit.
    net/9p: Mount only matching virtio channels
    de2104x: fix ethtool
    tproxy: check for transparent flag in ip_route_newports
    ipv6: add IPv6 to neighbour table overflow warning
    tcp: fix TSO FACK loss marking in tcp_mark_head_lost
    3c59x: fix regression from patch "Add ethtool WOL support"
    ipv6: add a missing unregister_pernet_subsys call
    s390: use free_netdev(netdev) instead of kfree()
    sgiseeq: use free_netdev(netdev) instead of kfree()
    rionet: use free_netdev(netdev) instead of kfree()
    ibm_newemac: use free_netdev(netdev) instead of kfree()
    smsc911x: Add MODULE_ALIAS()
    net: reset skb queue mapping when rx'ing over tunnel
    br2684: fix scheduling while atomic
    de2104x: fix TP link detection
    de2104x: fix power management
    de2104x: disable autonegotiation on broken hardware
    net: fix a lockdep splat
    e1000e: 82579 do not gate auto config of PHY by hardware during nominal use
    ...

    Linus Torvalds
     

28 Sep, 2010

4 commits

  • Fixes kernel bugzilla #16603

    tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write
    zero bytes, for example.

    There is also the problem higher up of how verify_iovec() works. It
    wants to prevent the total length from looking like an error return
    value.

    However it does this using 'int', but syscalls return 'long' (and
    thus signed 64-bit on 64-bit machines). So it could trigger
    false-positives on 64-bit as written. So fix it to use 'long'.

    Reported-by: Olaf Bonorden
    Reported-by: Daniel Büse
    Reported-by: Andrew Morton
    Signed-off-by: David S. Miller

    David S. Miller
     
  • p9_virtio_create will only compare the the channel's tag characters
    against the device name till the end of the channel's tag but not till
    the end of the device name. This means that if a user defines channels
    with the tags foo and foobar then he would mount foo when he requested
    foonot and may mount foo when he requested foobar.

    Thus it is necessary to check both string lengths against each other in
    case of a successful partial string match.

    Signed-off-by: Sven Eckelmann
    Signed-off-by: David S. Miller

    Sven Eckelmann
     
  • IPv4 and IPv6 have separate neighbour tables, so
    the warning messages should be distinguishable.

    [ Add a suitable message prefix on the ipv4 side as well -DaveM ]

    Signed-off-by: Ulrich Weber
    Signed-off-by: David S. Miller

    Ulrich Weber
     
  • When TCP uses FACK algorithm to mark lost packets in
    tcp_mark_head_lost(), if the number of packets in the (TSO) skb is
    greater than the number of packets that should be marked lost, TCP
    incorrectly exits the loop and marks no packets lost in the skb. This
    underestimates tp->lost_out and affects the recovery/retransmission.
    This patch fargments the skb and marks the correct amount of packets
    lost.

    Signed-off-by: Yuchung Cheng
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

27 Sep, 2010

3 commits

  • Return -ENOMEM when erroring on kmalloc and fix memory leaks when returning on error.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Eric Van Hensbergen

    Davidlohr Bueso
     
  • Clean up a missing exit path in the ipv6 module init routines. In
    addrconf_init we call ipv6_addr_label_init which calls register_pernet_subsys
    for the ipv6_addr_label_ops structure. But if module loading fails, or if the
    ipv6 module is removed, there is no corresponding unregister_pernet_subsys call,
    which leaves a now-bogus address on the pernet_list, leading to oopses in
    subsequent registrations. This patch cleans up both the failed load path and
    the unload path. Tested by myself with good results.

    Signed-off-by: Neil Horman

    include/net/addrconf.h | 1 +
    net/ipv6/addrconf.c | 11 ++++++++---
    net/ipv6/addrlabel.c | 5 +++++
    3 files changed, 14 insertions(+), 3 deletions(-)
    Signed-off-by: David S. Miller

    Neil Horman
     
  • You can't call atomic_notifier_chain_unregister() while in atomic context.

    Fix, call un/register_atmdevice_notifier in module __init and __exit.

    Bug report:
    http://comments.gmane.org/gmane.linux.network/172603

    Reported-by: Mikko Vinni
    Tested-by: Mikko Vinni
    Signed-off-by: Karl Hiramoto
    Signed-off-by: David S. Miller

    Karl Hiramoto
     

25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2010

7 commits

  • otherwise ECT(1) bit will get interpreted as RTO_ONLINK
    and routing will fail with XfrmOutBundleGenError.

    Signed-off-by: Ulrich Weber
    Signed-off-by: David S. Miller

    Ulrich Weber
     
  • we need to check proper socket type within ipv4_conntrack_defrag
    function before referencing the nodefrag flag.

    For example the tun driver receive path produces skbs with
    AF_UNSPEC socket type, and so current code is causing unwanted
    fragmented packets going out.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jiri Olsa
     
  • Fix checksum calculation in nf_nat_snmp_basic.

    Based on patches by Clark Wang and
    Stephen Hemminger .

    https://bugzilla.kernel.org/show_bug.cgi?id=17622

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • As soon as rcu_read_unlock() is called, there is no guarantee current
    thread can safely derefence t pointer, rcu protected.

    Fix is to copy t->alloc_size in a temporary variable.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip_route_me_harder can't create the route cache when the outdev is the same
    with the indev for the skbs whichout a valid protocol set.

    __mkroute_input functions has this check:
    1998 if (skb->protocol != htons(ETH_P_IP)) {
    1999 /* Not IP (i.e. ARP). Do not create route, if it is
    2000 * invalid for proxy arp. DNAT routes are always valid.
    2001 *
    2002 * Proxy arp feature have been extended to allow, ARP
    2003 * replies back to the same interface, to support
    2004 * Private VLAN switch technologies. See arp.c.
    2005 */
    2006 if (out_dev == in_dev &&
    2007 IN_DEV_PROXY_ARP_PVLAN(in_dev) == 0) {
    2008 err = -EINVAL;
    2009 goto cleanup;
    2010 }
    2011 }

    This patch gives the new skb a valid protocol to bypass this check. In order
    to make ipt_REJECT work with bridges, you also need to enable ip_forward.

    This patch also fixes a regression. When we used skb_copy_expand(), we
    didn't have this issue stated above, as the protocol was properly set.

    Signed-off-by: Changli Gao
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Changli Gao
     
  • I initially noticed this because of the compiler warning below, but it
    does seem to be a valid concern in the case where ct_sip_get_header()
    returns 0 in the first iteration of the while loop.

    net/netfilter/nf_conntrack_sip.c: In function 'sip_help_tcp':
    net/netfilter/nf_conntrack_sip.c:1379: warning: 'ret' may be used uninitialized in this function

    Signed-off-by: Simon Horman
    [Patrick: changed NF_DROP to NF_ACCEPT]
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Simon Horman
     
  • transparent field of a socket is either inet_twsk(sk)->tw_transparent
    for timewait sockets, or inet_sk(sk)->transparent for other sockets
    (TCP/UDP).

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2010

2 commits

  • Special care should be taken when slow path is hit in ip_fragment() :

    When walking through frags, we transfert truesize ownership from skb to
    frags. Then if we hit a slow_path condition, we must undo this or risk
    uncharging frags->truesize twice, and in the end, having negative socket
    sk_wmem_alloc counter, or even freeing socket sooner than expected.

    Many thanks to Nick Bowler, who provided a very clean bug report and
    test program.

    Thanks to Jarek for reviewing my first patch and providing a V2

    While Nick bisection pointed to commit 2b85a34e911 (net: No more
    expensive sock_hold()/sock_put() on each tx), underlying bug is older
    (2.6.12-rc5)

    A side effect is to extend work done in commit b2722b1c3a893e
    (ip_fragment: also adjust skb->truesize for packets not owned by a
    socket) to ipv6 as well.

    Reported-and-bisected-by: Nick Bowler
    Tested-by: Nick Bowler
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    CC: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • David S. Miller
     

21 Sep, 2010

5 commits

  • If a RST comes in immediately after checking sk->sk_err, tcp_poll will
    return POLLIN but not POLLOUT. Fix this by checking sk->sk_err at the end
    of tcp_poll. Additionally, ensure the correct order of operations on SMP
    machines with memory barriers.

    Signed-off-by: Tom Marshall
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Marshall
     
  • Just use explicit casts, since we really can't change the
    types of structures exported to userspace which have been
    around for 15 years or so.

    Reported-by: Dan Rosenberg
    Signed-off-by: David S. Miller

    David S. Miller
     
  • The family parameter xfrm_state_find is used to find a state matching a
    certain policy. This value is set to the template's family
    (encap_family) right before xfrm_state_find is called.
    The family parameter is however also used to construct a temporary state
    in xfrm_state_find itself which is wrong for inter-family scenarios
    because it produces a selector for the wrong family. Since this selector
    is included in the xfrm_user_acquire structure, user space programs
    misinterpret IPv6 addresses as IPv4 and vice versa.
    This patch splits up the original init_tempsel function into a part that
    initializes the selector respectively the props and id of the temporary
    state, to allow for differing ip address families whithin the state.

    Signed-off-by: Thomas Egerer
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Thomas Egerer
     
  • When a driver doesn't fill the entire buffer, old
    heap contents may remain, and if it also doesn't
    update the length properly, this old heap content
    will be copied back to userspace.

    It is very unlikely that this happens in any of
    the drivers using private ioctls since it would
    show up as junk being reported by iwpriv, but it
    seems better to be safe here, so use kzalloc.

    Reported-by: Jeff Mahoney
    Cc: stable@kernel.org
    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     
  • ipv6 can be a module, we should test CONFIG_IPV6 and CONFIG_IPV6_MODULE
    to enable ipv6 bits in ip_gre.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits)
    dca: disable dca on IOAT ver.3.0 multiple-IOH platforms
    netpoll: Disable IRQ around RCU dereference in netpoll_rx
    sctp: Do not reset the packet during sctp_packet_config().
    net/llc: storing negative error codes in unsigned short
    MAINTAINERS: move atlx discussions to netdev
    drivers/net/cxgb3/cxgb3_main.c: prevent reading uninitialized stack memory
    drivers/net/eql.c: prevent reading uninitialized stack memory
    drivers/net/usb/hso.c: prevent reading uninitialized memory
    xfrm: dont assume rcu_read_lock in xfrm_output_one()
    r8169: Handle rxfifo errors on 8168 chips
    3c59x: Remove atomic context inside vortex_{set|get}_wol
    tcp: Prevent overzealous packetization by SWS logic.
    net: RPS needs to depend upon USE_GENERIC_SMP_HELPERS
    phylib: fix PAL state machine restart on resume
    net: use rcu_barrier() in rollback_registered_many
    bonding: correctly process non-linear skbs
    ipv4: enable getsockopt() for IP_NODEFRAG
    ipv4: force_igmp_version ignored when a IGMPv3 query received
    ppp: potential NULL dereference in ppp_mp_explode()
    net/llc: make opt unsigned in llc_ui_setsockopt()
    ...

    Linus Torvalds
     

18 Sep, 2010

1 commit

  • sctp_packet_config() is called when getting the packet ready
    for appending of chunks. The function should not touch the
    current state, since it's possible to ping-pong between two
    transports when sending, and that can result packet corruption
    followed by skb overlfow crash.

    Reported-by: Thomas Dreibholz
    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

17 Sep, 2010

2 commits


15 Sep, 2010

3 commits

  • You cannot invoke __smp_call_function_single() unless the
    architecture sets this symbol.

    Reported-by: Daniel Hellstrom
    Signed-off-by: David S. Miller

    David S. Miller
     
  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies
    statfs() gives ESTALE error
    NFS: Fix a typo in nfs_sockaddr_match_ipaddr6
    sunrpc: increase MAX_HASHTABLE_BITS to 14
    gss:spkm3 miss returning error to caller when import security context
    gss:krb5 miss returning error to caller when import security context
    Remove incorrect do_vfs_lock message
    SUNRPC: cleanup state-machine ordering
    SUNRPC: Fix a race in rpc_info_open
    SUNRPC: Fix race corrupting rpc upcall
    Fix null dereference in call_allocate

    Linus Torvalds
     
  • netdev_wait_allrefs() waits that all references to a device vanishes.

    It currently uses a _very_ pessimistic 250 ms delay between each probe.
    Some users reported that no more than 4 devices can be dismantled per
    second, this is a pretty serious problem for some setups.

    Most of the time, a refcount is about to be released by an RCU callback,
    that is still in flight because rollback_registered_many() uses a
    synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
    number of online cpus is one, because synchronize_rcu() is then a no op.

    time to remove 50 ipip tunnels on a UP machine :

    before patch : real 11.910s
    after patch : real 1.250s

    Reported-by: Nicolas Dichtel
    Reported-by: Octavian Purdila
    Reported-by: Benjamin LaHaise
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Sep, 2010

3 commits

  • While integrating your man-pages patch for IP_NODEFRAG, I noticed
    that this option is settable by setsockopt(), but not gettable by
    getsockopt(). I suppose this is not intended. The (untested,
    trivial) patch below adds getsockopt() support.

    Signed-off-by: Michael kerrisk
    Acked-by: Jiri Olsa
    Signed-off-by: David S. Miller

    Michael Kerrisk
     
  • After all these years, it turns out that the
    /proc/sys/net/ipv4/conf/*/force_igmp_version
    parameter isn't fully implemented.

    *Symptom*:
    When set force_igmp_version to a value of 2, the kernel should only perform
    multicast IGMPv2 operations (IETF rfc2236). An host-initiated Join message
    will be sent as a IGMPv2 Join message. But if a IGMPv3 query message is
    received, the host responds with a IGMPv3 join message. Per rfc3376 and
    rfc2236, a IGMPv2 host should treat a IGMPv3 query as a IGMPv2 query and
    respond with an IGMPv2 Join message.

    *Consequences*:
    This is an issue when a IGMPv3 capable switch is the querier and will only
    issue IGMPv3 queries (which double as IGMPv2 querys) and there's an
    intermediate switch that is only IGMPv2 capable. The intermediate switch
    processes the initial v2 Join, but fails to recognize the IGMPv3 Join responses
    to the Query, resulting in a dropped connection when the intermediate v2-only
    switch times it out.

    *Identifying issue in the kernel source*:
    The issue is in this section of code (in net/ipv4/igmp.c), which is called when
    an IGMP query is received (from mainline 2.6.36-rc3 gitweb):
    ...
    A IGMPv3 query has a length >= 12 and no sources. This routine will exit after
    line 880, setting the general query timer (random timeout between 0 and query
    response time). This calls igmp_gq_timer_expire():
    ...
    .. which only sends a v3 response. So if a v3 query is received, the kernel
    always sends a v3 response.

    IGMP queries happen once every 60 sec (per vlan), so the traffic is low. A
    IGMPv3 query *is* a strict superset of a IGMPv2 query, so this patch properly
    short circuit's the v3 behaviour.

    One issue is that this does not address force_igmp_version=1. Then again, I've
    never seen any IGMPv1 multicast equipment in the wild. However there is a lot
    of v2-only equipment. If it's necessary to support the IGMPv1 case as well:

    837 if (len == 8 || IGMP_V2_SEEN(in_dev) || IGMP_V1_SEEN(in_dev)) {

    Signed-off-by: David S. Miller

    Bob Arendt
     
  • The members of struct llc_sock are unsigned so if we pass a negative
    value for "opt" it can cause a sign bug. Also it can cause an integer
    overflow when we multiply "opt * HZ".

    CC: stable@kernel.org
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     

13 Sep, 2010

7 commits

  • Four memory leak fixes in the 9P code.

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • The maximum size of the authcache is now set to 1024 (10 bits),
    but on our server we need at least 4096 (12 bits). Increase
    MAX_HASHTABLE_BITS to 14. This is a maximum of 16384 entries,
    each containing a pointer (8 bytes on x86_64). This is
    exactly the limit of kmalloc() (128K).

    Signed-off-by: Miquel van Smoorenburg
    Signed-off-by: Trond Myklebust

    Miquel van Smoorenburg
     
  • spkm3 miss returning error to up layer when import security context,
    it may be return ok though it has failed to import security context.

    Signed-off-by: Bian Naimeng
    Signed-off-by: Trond Myklebust

    Bian Naimeng
     
  • krb5 miss returning error to up layer when import security context,
    it may be return ok though it has failed to import security context.

    Signed-off-by: Bian Naimeng
    Signed-off-by: Trond Myklebust

    Bian Naimeng
     
  • This is just a minor cleanup: net/sunrpc/clnt.c clarifies the rpc client
    state machine by commenting each state and by laying out the functions
    implementing each state in the order that each state is normally
    executed (in the absence of errors).

    The previous patch "Fix null dereference in call_allocate" changed the
    order of the states. Move the functions and update the comments to
    reflect the change.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • There is a race between rpc_info_open and rpc_release_client()
    in that nothing stops a process from opening the file after
    the clnt->cl_kref goes to zero.

    Fix this by using atomic_inc_unless_zero()...

    Reported-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Cc: stable@kernel.org

    Trond Myklebust
     
  • If rpc_queue_upcall() adds a new upcall to the rpci->pipe list just
    after rpc_pipe_release calls rpc_purge_list(), but before it calls
    gss_pipe_release (as rpci->ops->release_pipe(inode)), then the latter
    will free a message without deleting it from the rpci->pipe list.

    We will be left with a freed object on the rpc->pipe list. Most
    frequent symptoms are kernel crashes in rpc.gssd system calls on the
    pipe in question.

    Reported-by: J. Bruce Fields
    Signed-off-by: Trond Myklebust
    Cc: stable@kernel.org

    Trond Myklebust