07 Feb, 2013

1 commit

  • There are transients during normal FRTO procedure during which
    the packets_in_flight can go to zero between write_queue state
    updates and firing the resulting segments out. As FRTO processing
    occurs during that window the check must be more precise to
    not match "spuriously" :-). More specificly, e.g., when
    packets_in_flight is zero but FLAG_DATA_ACKED is true the problematic
    branch that set cwnd into zero would not be taken and new segments
    might be sent out later.

    Signed-off-by: Ilpo Järvinen
    Tested-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

05 Feb, 2013

1 commit

  • This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and
    tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various
    reasons which are not currently tracked.

    Signed-off-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Vijay Subramanian
     

04 Feb, 2013

2 commits

  • Commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start())
    uncovered a bug in FRTO code :
    tcp_process_frto() is setting snd_cwnd to 0 if the number
    of in flight packets is 0.

    As Neal pointed out, if no packet is in flight we lost our
    chance to disambiguate whether a loss timeout was spurious.

    We should assume it was a proper loss.

    Reported-by: Pasi Kärkkäinen
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Cc: Ilpo Järvinen
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()),
    a nul snd_cwnd triggers an infinite loop in tcp_slow_start()

    Avoid this infinite loop and log a one time error for further
    analysis. FRTO code is suspected to cause this bug.

    Reported-by: Pasi Kärkkäinen
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Feb, 2013

1 commit

  • On receiving the SYN-ACK, Fast Open checks icsk_retransmit for SYN
    retransmission to detect SYN/data drops. But if F-RTO is disabled,
    icsk_retransmit is reset at step D of tcp_fastretrans_alert() (
    under tcp_ack()) before tcp_rcv_fastopen_synack(). The fix is to use
    total_retrans instead which accounts for SYN retransmission regardless
    the use of F-RTO.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

30 Jan, 2013

1 commit

  • We drop a connection request if the accept backlog is full and there are
    sufficient packets in the syn queue to warrant starting drops. Increment the
    appropriate counters so this isn't silent, for accurate stats and help in
    debugging.

    This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the
    counter LINUX_MIB_LISTENOVERFLOWS.

    Signed-off-by: Nivedita Singhvi
    Acked-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Nivedita Singhvi
     

28 Jan, 2013

1 commit

  • Due to IP_GRE GSO support, GRE can recieve non linear skb which
    results in panic in case of GRE_CSUM. Following patch fixes it by
    using correct csum API.

    Bug introduced in commit 6b78f16e4bdde3936b (gre: add GSO support)

    Signed-off-by: Pravin B Shelar
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

23 Jan, 2013

2 commits

  • git commit 9cb3a50c (ipv4: Invalidate the socket cached route on
    pmtu events if possible) introduced a refcount problem. We don't
    get a refcount on the route if we get it from__sk_dst_get(), but
    we need one if we want to reuse this route because __sk_dst_set()
    releases the refcount of the old route. This patch adds proper
    refcount handling for that case. We introduce a 'new' flag to
    indicate that we are going to use a new route and we release the
    old route only if we replace it by a new one.

    Reported-by: Julian Anastasov
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • Steffen Klassert says:

    ====================
    1) The transport header did not point to the right place after
    esp/ah processing on tunnel mode in the receive path. As a
    result, the ECN field of the inner header was not set correctly,
    fixes from Li RongQing.

    2) We did a null check too late in one of the xfrm_replay advance
    functions. This can lead to a division by zero, fix from
    Nickolai Zeldovich.

    3) The size calculation of the hash table missed the muiltplication
    with the actual struct size when the hash table is freed.
    We might call the wrong free function, fix from Michal Kubecek.

    4) On IPsec pmtu events we can't access the transport headers of
    the original packet, so force a relookup for all routes
    to notify about the pmtu event.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Jan, 2013

2 commits

  • This implements a socket release callback function to check
    if the socket cached route got invalid during the time
    we owned the socket. The function is used from udp, raw
    and ping sockets.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • The route lookup in ipv4_sk_update_pmtu() might return a route
    different from the route we cached at the socket. This is because
    standart routes are per cpu, so each cpu has it's own struct rtable.
    This means that we do not invalidate the socket cached route if the
    NET_RX_SOFTIRQ is not served by the same cpu that the sending socket
    uses. As a result, the cached route reused until we disconnect.

    With this patch we invalidate the socket cached route if possible.
    If the socket is owened by the user, we can't update the cached
    route directly. A followup patch will implement socket release
    callback functions for datagram sockets to handle this case.

    Reported-by: Yurij M. Plotnikov
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     

21 Jan, 2013

2 commits

  • On IPsec pmtu events we can't access the transport headers of
    the original packet, so we can't find the socket that sent
    the packet. The only chance to notify the socket about the
    pmtu change is to force a relookup for all routes. This
    patch implenents this for the IPsec protocols.

    Signed-off-by: Steffen Klassert

    Steffen Klassert
     
  • commit 563d34d057 (tcp: dont drop MTU reduction indications)
    added an error leading to incorrect accounting of
    LINUX_MIB_LOCKDROPPEDICMPS

    If socket is owned by the user, we want to increment
    this SNMP counter, unless the message is a
    (ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED) one.

    Reported-by: Maciej Żenczykowski
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Signed-off-by: Maciej Żenczykowski
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jan, 2013

2 commits

  • Routes with locked mtu should not use learned pmtu informations,
    so do not update the pmtu on these routes.

    Reported-by: Julian Anastasov
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • The output route check was introduced with git commit 261663b0
    (ipv4: Don't use the cached pmtu informations for input routes)
    during times when we cached the pmtu informations on the
    inetpeer. Now the pmtu informations are back in the routes,
    so this check is obsolete. It also had some unwanted side effects,
    as reported by Timo Teras and Lukas Tribus.

    Signed-off-by: Steffen Klassert
    Acked-by: Timo Teräs
    Signed-off-by: David S. Miller

    Steffen Klassert
     

11 Jan, 2013

3 commits

  • commit c3ae62af8e755 (tcp: should drop incoming frames without ACK flag
    set) added a regression on the handling of RST messages.

    RST should be allowed to come even without ACK bit set. We validate
    the RST by checking the exact sequence, as requested by RFC 793 and
    5961 3.2, in tcp_validate_incoming()

    Reported-by: Eric Wong
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Tested-by: Eric Wong
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Under unusual circumstances, TCP collapse can split a big GRO TCP packet
    while its being used in a splice(socket->pipe) operation.

    skb_splice_bits() releases the socket lock before calling
    splice_to_pipe().

    [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
    [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
    [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13

    To fix this problem, we must eat skbs in tcp_recv_skb().

    Remove the inline keyword from tcp_recv_skb() definition since
    it has three call sites.

    Reported-by: Christian Becker
    Cc: Willy Tarreau
    Signed-off-by: Eric Dumazet
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
    added a regression.

    [ 83.843570] INFO: rcu_sched self-detected stall on CPU
    [ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
    [ 83.844582] Task dump for CPU 6:
    [ 83.844584] netperf R running task 0 8966 8952 0x0000000c
    [ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
    [ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
    [ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
    [ 83.844594] Call Trace:
    [ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0
    [ 83.844601] [] ? schedule+0x29/0x70
    [ 83.844606] [] ? tcp_splice_data_recv+0x42/0x50
    [ 83.844610] [] ? tcp_read_sock+0xda/0x260
    [ 83.844613] [] ? tcp_prequeue_process+0xb0/0xb0
    [ 83.844615] [] ? tcp_splice_read+0xc0/0x250
    [ 83.844618] [] ? sock_splice_read+0x22/0x30
    [ 83.844622] [] ? do_splice_to+0x7b/0xa0
    [ 83.844627] [] ? sys_splice+0x59c/0x5d0
    [ 83.844630] [] ? putname+0x2b/0x40
    [ 83.844633] [] ? do_sys_open+0x174/0x1e0
    [ 83.844636] [] ? system_call_fastpath+0x16/0x1b

    if recv_actor() returns 0, we should stop immediately,
    because looping wont give a chance to drain the pipe.

    Signed-off-by: Eric Dumazet
    Cc: Willy Tarreau
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jan, 2013

1 commit

  • A regression is introduced by the following commit:

    commit 4d52cfbef6266092d535237ba5a4b981458ab171
    Author: Eric Dumazet
    Date: Tue Jun 2 00:42:16 2009 -0700

    net: ipv4/ip_sockglue.c cleanups

    Pure cleanups

    but it is not a pure cleanup...

    - if (val != -1 && (val < 1 || val>255))
    + if (val != -1 && (val < 0 || val > 255))

    Since there is no reason provided to allow ttl=0, change it back.

    Reported-by: nitin padalia
    Cc: nitin padalia
    Cc: Eric Dumazet
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cong Wang
     

08 Jan, 2013

1 commit

  • IPsec tunnel does not set ECN field to CE in inner header when
    the ECN field in the outer header is CE, and the ECN field in
    the inner header is ECT(0) or ECT(1).

    The cause is ipip_hdr() does not return the correct address of
    inner header since skb->transport-header is not the inner header
    after esp_input_done2(), or ah_input().

    Signed-off-by: Li RongQing
    Signed-off-by: Steffen Klassert

    Li RongQing
     

07 Jan, 2013

1 commit

  • The NULL pointer check `!ifa' should come before its first use.

    [ Bug origin : commit fd23c3b31107e2fc483301ee923d8a1db14e53f4
    (ipv4: Add hash table of interface addresses) in linux-2.6.39 ]

    Signed-off-by: Xi Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Xi Wang
     

05 Jan, 2013

1 commit


29 Dec, 2012

1 commit

  • Pablo Neira Ayuso says:

    ====================
    The following batch contains Netfilter fixes for 3.8-rc1. They are
    a mixture of old bugs that have passed unnoticed (I'll pass these to
    stable) and more fresh ones from the previous merge window, they are:

    * Fix for MAC address in 6in4 tunnels via NFLOG that results in ulogd
    showing up wrong address, from Bob Hockney.

    * Fix a comment in nf_conntrack_ipv6, from Florent Fourcot.

    * Fix a leak an error path in ctnetlink while creating an expectation,
    from Jesper Juhl.

    * Fix missing ICMP time exceeded in the IPv6 defragmentation code, from
    Haibo Xi.

    * Fix inconsistent handling of routing changes in MASQUERADE for the
    new connections case, from Andrew Collins.

    * Fix a missing skb_reset_transport in ip[6]t_REJECT that leads to
    crashes in the ixgbe driver (since it seems to access the transport
    header with TSO enabled), from Mukund Jampala.

    * Recover obsoleted NOTRACK target by including it into the CT and spot
    a warning via printk about being obsoleted. Many people don't check the
    scheduled to be removal file under Documentation, so we follow some
    less agressive approach to kill this in a year or so. Spotted by Florian
    Westphal, patch from myself.

    * Fix race condition in xt_hashlimit that allows to create two or more
    entries, from myself.

    * Fix crash if the CT is used due to the recently added facilities to
    consult the dying and unconfirmed conntrack lists, from myself.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Dec, 2012

2 commits

  • ipgre_tunnel_xmit() incorrectly sets transport header to inner payload
    instead of GRE header. It seems copy-and-pasted from ipip.c.
    So set transport header to gre header.
    (In ipip case the transport header is the inner ip header, so that's
    correct.)

    Found by inspection. In practice the incorrect transport header
    doesn't matter because the skb usually is sent to another net_device
    or socket, so the transport header isn't referenced.

    Signed-off-by: Isaku Yamahata
    Signed-off-by: David S. Miller

    Isaku Yamahata
     
  • In commit 96e0bf4b5193d (tcp: Discard segments that ack data not yet
    sent) John Dykstra enforced a check against ack sequences.

    In commit 354e4aa391ed5 (tcp: RFC 5961 5.2 Blind Data Injection Attack
    Mitigation) I added more safety tests.

    But we missed fact that these tests are not performed if ACK bit is
    not set.

    RFC 793 3.9 mandates TCP should drop a frame without ACK flag set.

    " fifth check the ACK field,
    if the ACK bit is off drop the segment and return"

    Not doing so permits an attacker to only guess an acceptable sequence
    number, evading stronger checks.

    Many thanks to Zhiyun Qian for bringing this issue to our attention.

    See :
    http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf

    Reported-by: Zhiyun Qian
    Signed-off-by: Eric Dumazet
    Cc: Nandita Dukkipati
    Cc: Neal Cardwell
    Cc: John Dykstra
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Dec, 2012

1 commit

  • Sedat reported the following commit caused a regression:

    commit 9650388b5c56578fdccc79c57a8c82fb92b8e7f1
    Author: Eric Dumazet
    Date: Fri Dec 21 07:32:10 2012 +0000

    ipv4: arp: fix a lockdep splat in arp_solicit

    This is due to the 6th parameter of arp_send() needs to be NULL
    for the broadcast case, the above commit changed it to an all-zero
    array by mistake.

    Reported-by: Sedat Dilek
    Tested-by: Sedat Dilek
    Cc: Sedat Dilek
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Julian Anastasov
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cong Wang
     

22 Dec, 2012

3 commits

  • Yan Burman reported following lockdep warning :

    =============================================
    [ INFO: possible recursive locking detected ]
    3.7.0+ #24 Not tainted
    ---------------------------------------------
    swapper/1/0 is trying to acquire lock:
    (&n->lock){++--..}, at: [] __neigh_event_send
    +0x2e/0x2f0

    but task is already holding lock:
    (&n->lock){++--..}, at: [] arp_solicit+0x1d4/0x280

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&n->lock);
    lock(&n->lock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by swapper/1/0:
    #0: (((&n->timer))){+.-...}, at: []
    call_timer_fn+0x0/0x1c0
    #1: (&n->lock){++--..}, at: [] arp_solicit
    +0x1d4/0x280
    #2: (rcu_read_lock_bh){.+....}, at: []
    dev_queue_xmit+0x0/0x5d0
    #3: (rcu_read_lock_bh){.+....}, at: []
    ip_finish_output+0x13e/0x640

    stack backtrace:
    Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
    Call Trace:
    [] validate_chain+0xdcc/0x11f0
    [] ? __lock_acquire+0x440/0xc30
    [] ? kmem_cache_free+0xe5/0x1c0
    [] __lock_acquire+0x440/0xc30
    [] ? inet_getpeer+0x40/0x600
    [] ? __lock_acquire+0x440/0xc30
    [] ? __neigh_event_send+0x2e/0x2f0
    [] lock_acquire+0x95/0x140
    [] ? __neigh_event_send+0x2e/0x2f0
    [] ? __lock_acquire+0x440/0xc30
    [] _raw_write_lock_bh+0x3b/0x50
    [] ? __neigh_event_send+0x2e/0x2f0
    [] __neigh_event_send+0x2e/0x2f0
    [] neigh_resolve_output+0x16b/0x270
    [] ip_finish_output+0x34d/0x640
    [] ? ip_finish_output+0x13e/0x640
    [] ? vxlan_xmit+0x556/0xbec [vxlan]
    [] ip_output+0x80/0xf0
    [] ip_local_out+0x28/0x80
    [] vxlan_xmit+0x66a/0xbec [vxlan]
    [] ? vxlan_xmit+0x556/0xbec [vxlan]
    [] ? skb_gso_segment+0x2b0/0x2b0
    [] ? _raw_spin_unlock_irqrestore+0x65/0x80
    [] ? dev_queue_xmit_nit+0x207/0x270
    [] dev_hard_start_xmit+0x298/0x5d0
    [] dev_queue_xmit+0x2f3/0x5d0
    [] ? dev_hard_start_xmit+0x5d0/0x5d0
    [] arp_xmit+0x58/0x60
    [] arp_send+0x3b/0x40
    [] arp_solicit+0x204/0x280
    [] ? neigh_add+0x310/0x310
    [] neigh_probe+0x45/0x70
    [] neigh_timer_handler+0x1a0/0x2a0
    [] call_timer_fn+0x7f/0x1c0
    [] ? detach_if_pending+0x120/0x120
    [] run_timer_softirq+0x238/0x2b0
    [] ? neigh_add+0x310/0x310
    [] __do_softirq+0x101/0x280
    [] call_softirq+0x1c/0x30
    [] do_softirq+0x85/0xc0
    [] irq_exit+0x9e/0xc0
    [] smp_apic_timer_interrupt+0x68/0xa0
    [] apic_timer_interrupt+0x6f/0x80
    [] ? mwait_idle+0xa4/0x1c0
    [] ? mwait_idle+0x9b/0x1c0
    [] cpu_idle+0x89/0xe0
    [] start_secondary+0x1b2/0x1b6

    Bug is from arp_solicit(), releasing the neigh lock after arp_send()
    In case of vxlan, we eventually need to write lock a neigh lock later.

    Its a false positive, but we can get rid of it without lockdep
    annotations.

    We can instead use neigh_ha_snapshot() helper.

    Reported-by: Yan Burman
    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Once skb_realloc_headroom() is called, tiph might point to freed memory.

    Cache tiph->ttl value before the reallocation, to avoid unexpected
    behavior.

    Signed-off-by: Eric Dumazet
    Cc: Isaku Yamahata
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ipgre_tunnel_xmit() parses network header as IP unconditionally.
    But transmitting packets are not always IP packet. For example such packet
    can be sent by packet socket with sockaddr_ll.sll_protocol set.
    So make the function check if skb->protocol is IP.

    Signed-off-by: Isaku Yamahata
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Isaku Yamahata
     

17 Dec, 2012

2 commits

  • Since (a0ecb85 netfilter: nf_nat: Handle routing changes in MASQUERADE
    target), the MASQUERADE target handles routing changes which affect
    the output interface of a connection, but only for ESTABLISHED
    connections. It is also possible for NEW connections which
    already have a conntrack entry to be affected by routing changes.

    This adds a check to drop entries in the NEW+conntrack state
    when the oif has changed.

    Signed-off-by: Andrew Collins
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Andrew Collins
     
  • The problem occurs when iptables constructs the tcp reset packet.
    It doesn't initialize the pointer to the tcp header within the skb.
    When the skb is passed to the ixgbe driver for transmit, the ixgbe
    driver attempts to access the tcp header and crashes.
    Currently, other drivers (such as our 1G e1000e or igb drivers) don't
    access the tcp header on transmit unless the TSO option is turned on.

    BUG: unable to handle kernel NULL pointer dereference at 0000000d
    IP: [] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
    *pdpt = 0000000085e5d001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    [...]
    Pid: 0, comm: swapper Tainted: P 2.6.35.12 #1 Greencity/Thurley
    EIP: 0060:[] EFLAGS: 00010246 CPU: 16
    EIP is at ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
    EAX: c7628820 EBX: 00000007 ECX: 00000000 EDX: 00000000
    ESI: 00000008 EDI: c6882180 EBP: dfc6b000 ESP: ced95c48
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 0, ti=ced94000 task=ced73bd0 task.ti=ced94000)
    Stack:
    cbec7418 c779e0d8 c77cc888 c77cc8a8 0903010a 00000000 c77c0008 00000002
    cd4997c0 00000010 dfc6b000 00000000 d0d176c9 c77cc8d8 c6882180 cbec7318
    00000004 00000004 cbec7230 cbec7110 00000000 cbec70c0 c779e000 00000002
    Call Trace:
    [] ? 0xd0d176c9
    [] ? 0xd0d18a4d
    [] ? dev_hard_start_xmit+0x218/0x2d7
    [] ? sch_direct_xmit+0x4b/0x114
    [] ? __qdisc_run+0xca/0xe0
    [] ? dev_queue_xmit+0x2d1/0x3d0
    [] ? neigh_resolve_output+0x1c5/0x20f
    [] ? neigh_update+0x29c/0x330
    [] ? arp_process+0x49c/0x4cd
    [] ? nf_hook_slow+0x3f/0xac
    [] ? arp_process+0x0/0x4cd
    [] ? arp_process+0x0/0x4cd
    [] ? T.901+0x38/0x3b
    [] ? arp_rcv+0xa3/0xb4
    [] ? arp_process+0x0/0x4cd
    [] ? __netif_receive_skb+0x32b/0x346
    [] ? netif_receive_skb+0x5a/0x5f
    [] ? napi_skb_finish+0x1b/0x30
    [] ? ixgbe_xmit_frame_ring+0x1564/0x2260 [ixgbe]
    [] ? lapic_next_event+0x13/0x16
    [] ? clockevents_program_event+0xd2/0xe4
    [] ? net_rx_action+0x55/0x127
    [] ? __do_softirq+0x77/0xeb
    [] ? do_softirq+0x23/0x27
    [] ? do_IRQ+0x7d/0x8e
    [] ? common_interrupt+0x29/0x30
    [] ? mwait_idle+0x48/0x4d
    [] ? cpu_idle+0x37/0x4c
    Code: df 09 d7 0f 94 c2 0f b6 d2 e9 e7 fb ff ff 31 db 31 c0 e9 38
    ff ff ff 80 78 06 06 0f 85 3e fb ff ff 8b 7c 24 38 8b 8f b8 00 00 00
    b6 51 0d f6 c2 01 0f 85 27 fb ff ff 80 e2 02 75 0d 8b 6c 24
    EIP: [] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] SS:ESP

    Signed-off-by: Mukund Jampala
    Signed-off-by: Pablo Neira Ayuso

    Mukund Jampala
     

15 Dec, 2012

1 commit

  • If in either of the above functions inet_csk_route_child_sock() or
    __inet_inherit_port() fails, the newsk will not be freed:

    unreferenced object 0xffff88022e8a92c0 (size 1592):
    comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
    hex dump (first 32 bytes):
    0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
    02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x21/0x3e
    [] kmem_cache_alloc+0xb5/0xc5
    [] sk_prot_alloc.isra.53+0x2b/0xcd
    [] sk_clone_lock+0x16/0x21e
    [] inet_csk_clone_lock+0x10/0x7b
    [] tcp_create_openreq_child+0x21/0x481
    [] tcp_v4_syn_recv_sock+0x3a/0x23b
    [] tcp_check_req+0x29f/0x416
    [] tcp_v4_do_rcv+0x161/0x2bc
    [] tcp_v4_rcv+0x6c9/0x701
    [] ip_local_deliver_finish+0x70/0xc4
    [] ip_local_deliver+0x4e/0x7f
    [] ip_rcv_finish+0x1fc/0x233
    [] ip_rcv+0x217/0x267
    [] __netif_receive_skb+0x49e/0x553
    [] netif_receive_skb+0x50/0x82

    This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
    a single sock_put() is not enough to free the memory. Additionally, things
    like xfrm, memcg, cookie_values,... may have been initialized.
    We have to free them properly.

    This is fixed by forcing a call to tcp_done(), ending up in
    inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
    because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
    xfrm,...

    Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
    force it entering inet_csk_destroy_sock. To avoid the warning in
    inet_csk_destroy_sock, inet_num has to be set to 0.
    As inet_csk_destroy_sock does a dec on orphan_count, we first have to
    increase it.

    Calling tcp_done() allows us to remove the calls to
    tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

    A similar approach is taken for dccp by calling dccp_done().

    This is in the kernel since 093d282321 (tproxy: fix hash locking issue
    when using port redirection in __inet_inherit_port()), thus since
    version >= 2.6.37.

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

14 Dec, 2012

1 commit

  • Pull trivial branch from Jiri Kosina:
    "Usual stuff -- comment/printk typo fixes, documentation updates, dead
    code elimination."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    HOWTO: fix double words typo
    x86 mtrr: fix comment typo in mtrr_bp_init
    propagate name change to comments in kernel source
    doc: Update the name of profiling based on sysfs
    treewide: Fix typos in various drivers
    treewide: Fix typos in various Kconfig
    wireless: mwifiex: Fix typo in wireless/mwifiex driver
    messages: i2o: Fix typo in messages/i2o
    scripts/kernel-doc: check that non-void fcts describe their return value
    Kernel-doc: Convention: Use a "Return" section to describe return values
    radeon: Fix typo and copy/paste error in comments
    doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
    various: Fix spelling of "asynchronous" in comments.
    Fix misspellings of "whether" in comments.
    eisa: Fix spelling of "asynchronous".
    various: Fix spelling of "registered" in comments.
    doc: fix quite a few typos within Documentation
    target: iscsi: fix comment typos in target/iscsi drivers
    treewide: fix typo of "suport" in various comments and Kconfig
    treewide: fix typo of "suppport" in various comments
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

11 Dec, 2012

2 commits

  • This patch replace the obsolete simple_strto with kstrto

    Signed-off-by: Abhijit Pawar
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Abhijit Pawar
     
  • ip_check_defrag() might be called from af_packet within the
    RX path where shared SKBs are used, so it must not modify
    the input SKB before it has unshared it for defragmentation.
    Use skb_copy_bits() to get the IP header and only pull in
    everything later.

    The same is true for the other caller in macvlan as it is
    called from dev->rx_handler which can also get a shared SKB.

    Reported-by: Eric Leblond
    Cc: stable@vger.kernel.org
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

10 Dec, 2012

4 commits

  • Add logic to verify that a port comparison byte code operation
    actually has the second inet_diag_bc_op from which we read the port
    for such operations.

    Previously the code blindly referenced op[1] without first checking
    whether a second inet_diag_bc_op struct could fit there. So a
    malicious user could make the kernel read 4 bytes beyond the end of
    the bytecode array by claiming to have a whole port comparison byte
    code (2 inet_diag_bc_op structs) when in fact the bytecode was not
    long enough to hold both.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Add logic to check the address family of the user-supplied conditional
    and the address family of the connection entry. We now do not do
    prefix matching of addresses from different address families (AF_INET
    vs AF_INET6), except for the previously existing support for having an
    IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
    maintains as-is).

    This change is needed for two reasons:

    (1) The addresses are different lengths, so comparing a 128-bit IPv6
    prefix match condition to a 32-bit IPv4 connection address can cause
    us to unwittingly walk off the end of the IPv4 address and read
    garbage or oops.

    (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
    simple bit-wise comparison of the prefixes is not meaningful, and
    would lead to bogus results (except for the IPv4-mapped IPv6 case,
    which this commit maintains).

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
    operations.

    Previously we did not validate the inet_diag_hostcond, address family,
    address length, and prefix length. So a malicious user could make the
    kernel read beyond the end of the bytecode array by claiming to have a
    whole inet_diag_hostcond when the bytecode was not long enough to
    contain a whole inet_diag_hostcond of the given address family. Or
    they could make the kernel read up to about 27 bytes beyond the end of
    a connection address by passing a prefix length that exceeded the
    length of addresses of the given family.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
    instantiated for IPv4 traffic and in the SYN-RECV state were actually
    created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
    means that for such connections inet6_rsk(req) returns a pointer to a
    random spot in memory up to roughly 64KB beyond the end of the
    request_sock.

    With this bug, for a server using AF_INET6 TCP sockets and serving
    IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
    inet_diag_fill_req() causing an oops or the export to user space of 16
    bytes of kernel memory as a garbage IPv6 address, depending on where
    the garbage inet6_rsk(req) pointed.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell