21 May, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1674 commits)
    qlcnic: adding co maintainer
    ixgbe: add support for active DA cables
    ixgbe: dcb, do not tag tc_prio_control frames
    ixgbe: fix ixgbe_tx_is_paused logic
    ixgbe: always enable vlan strip/insert when DCB is enabled
    ixgbe: remove some redundant code in setting FCoE FIP filter
    ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp
    ixgbe: fix header len when unsplit packet overflows to data buffer
    ipv6: Never schedule DAD timer on dead address
    ipv6: Use POSTDAD state
    ipv6: Use state_lock to protect ifa state
    ipv6: Replace inet6_ifaddr->dead with state
    cxgb4: notify upper drivers if the device is already up when they load
    cxgb4: keep interrupts available when the ports are brought down
    cxgb4: fix initial addition of MAC address
    cnic: Return SPQ credit to bnx2x after ring setup and shutdown.
    cnic: Convert cnic_local_flags to atomic ops.
    can: Fix SJA1000 command register writes on SMP systems
    bridge: fix build for CONFIG_SYSFS disabled
    ARCNET: Limit com20020 PCI ID matches for SOHARD cards
    ...

    Fix up various conflicts with pcmcia tree drivers/net/
    {pcmcia/3c589_cs.c, wireless/orinoco/orinoco_cs.c and
    wireless/orinoco/spectrum_cs.c} and feature removal
    (Documentation/feature-removal-schedule.txt).

    Also fix a non-content conflict due to pm_qos_requirement getting
    renamed in the PM tree (now pm_qos_request) in net/mac80211/scan.c

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

18 May, 2010

8 commits

  • This patch removes from net/ (but not any netfilter files)
    all the unnecessary return; statements that precede the
    last closing brace of void functions.

    It does not remove the returns that are immediately
    preceded by a label as gcc doesn't like that.

    Done via:
    $ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
    xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • skb rxhash should be cleared when a skb is handled by a tunnel before
    being delivered again, so that correct packet steering can take place.

    There are other cleanups and accounting that we can factorize in a new
    helper, skb_tunnel_rx()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Commit 33ad798c924b4a (tcp: options clean up) introduced a problem
    if MD5+SACK+timestamps were used in initial SYN message.

    Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
    sessions, but since 40 bytes of tcp options space are not enough to
    store all the bits needed, we chose to disable timestamps in this case.

    We send a SYN-ACK _without_ timestamp option, but socket has timestamps
    enabled and all further outgoing messages contain a TS block, all with
    the initial timestamp of the remote peer.

    Fix is to really disable timestamps option for the whole session.

    Reported-by: Bijay Singh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Also added an explicit break; to avoid
    a fallthrough in net/ipv4/tcp_input.c

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • TCP outgoing packets can avoid two atomic ops, and dirtying
    of previously higly contended cache line using new refdst
    infrastructure.

    Note 1: loopback device excluded because of !IFF_XMIT_DST_RELEASE
    Note 2: UDP packets dsts are built before ip_queue_xmit().

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use ip_route_input_noref() in ip fast path, to avoid two atomic ops per
    incoming packet.

    Note: loopback is excluded from this optimization in ip_rcv_finish()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip_route_input() is the version returning a refcounted dst, while
    ip_route_input_noref() returns a non refcounted one.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use low order bit of skb->_skb_dst to tell dst is not refcounted.

    Change _skb_dst to _skb_refdst to make sure all uses are catched.

    skb_dst() returns the dst, regardless of noref bit set or not, but
    with a lockdep check to make sure a noref dst is not given if current
    user is not rcu protected.

    New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
    (with lockdep check)

    skb_dst_drop() drops a reference only if skb dst was refcounted.

    skb_dst_force() helper is used to force a refcount on dst, when skb
    is queued and not anymore RCU protected.

    Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
    !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
    sock_queue_rcv_skb(), in __nf_queue().

    Use skb_dst_force() in dev_requeue_skb().

    Note: dst_use_noref() still dirties dst, we might transform it
    later to do one dirtying per jiffies.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2010

1 commit


16 May, 2010

3 commits

  • TCP-MD5 sessions have intermittent failures, when route cache is
    invalidated. ip_queue_xmit() has to find a new route, calls
    sk_setup_caps(sk, &rt->u.dst), destroying the

    sk->sk_route_caps &= ~NETIF_F_GSO_MASK

    that MD5 desperately try to make all over its way (from
    tcp_transmit_skb() for example)

    So we send few bad packets, and everything is fine when
    tcp_transmit_skb() is called again for this socket.

    Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
    socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

    Reported-by: Bhaskar Dutta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP MD5 support uses percpu data for temporary storage. It currently
    disables preemption so that same storage cannot be reclaimed by another
    thread on same cpu.

    We also have to make sure a softirq handler wont try to use also same
    context. Various bug reports demonstrated corruptions.

    Fix is to disable preemption and BH.

    Reported-by: Bhaskar Dutta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

14 May, 2010

1 commit


13 May, 2010

3 commits

  • This patch removes from net/ netfilter files
    all the unnecessary return; statements that precede the
    last closing brace of void functions.

    It does not remove the returns that are immediately
    preceded by a label as gcc doesn't like that.

    Done via:
    $ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
    xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'

    Signed-off-by: Joe Perches
    [Patrick: changed to keep return statements in otherwise empty function bodies]
    Signed-off-by: Patrick McHardy

    Joe Perches
     
  • Make sure all printk messages have a severity level.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Patrick McHardy

    Stephen Hemminger
     
  • Change netfilter asserts to standard WARN_ON. This has the
    benefit of backtrace info and also causes netfilter errors
    to show up on kerneloops.org.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Patrick McHardy

    Stephen Hemminger
     

12 May, 2010

7 commits


11 May, 2010

1 commit


10 May, 2010

2 commits

  • Need to check both CONFIG_FOO and CONFIG_FOO_MODULE

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Fixes the expiration timer for unresolved multicast route entries.
    In case new multicast routing requests come in faster than the
    expiration timeout occurs (e.g. zap through multicast TV streams), the
    timer is prevented from being called at time for already existing entries.

    As the single timer is resetted to default whenever a new entry is made,
    the timeout for existing unresolved entires are missed and/or not
    updated. As a consequence new requests are denied when the limit of
    unresolved entries has been reached because old entries live longer than
    they are supposed to.

    The solution is to reset the timer only for the first unresolved entry
    in the multicast routing cache. All other timers are already set and
    updated correctly within the timer function itself by now.

    Signed-off by: Andreas Meissner
    Signed-off-by: David S. Miller

    Andreas Meissner
     

08 May, 2010

1 commit

  • A while back there was a discussion regarding the rt_secret_interval timer.
    Given that we've had the ability to do emergency route cache rebuilds for awhile
    now, based on a statistical analysis of the various hash chain lengths in the
    cache, the use of the flush timer is somewhat redundant. This patch removes the
    rt_secret_interval sysctl, allowing us to rely solely on the statistical
    analysis mechanism to determine the need for route cache flushes.

    Signed-off-by: Neil Horman
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

07 May, 2010

1 commit

  • commit 2783ef23 moved the initialisation of saddr and daddr after
    pskb_may_pull() to avoid a potential data corruption. Unfortunately
    also placing it after the short packet and bad checksum error paths,
    where these variables are used for logging. The result is bogus
    output like

    [92238.389505] UDP: short packet: From 2.0.0.0:65535 23715/178 to 0.0.0.0:65535

    Moving the saddr and daddr initialisation above the error paths, while still
    keeping it after the pskb_may_pull() to keep the fix from commit 2783ef23.

    Signed-off-by: Bjørn Mork
    Cc: stable@kernel.org
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Bjørn Mork
     

03 May, 2010

1 commit


02 May, 2010

2 commits


29 Apr, 2010

3 commits

  • When queueing a skb to socket, we can immediately release its dst if
    target socket do not use IP_CMSG_PKTINFO.

    tcp_data_queue() can drop dst too.

    This to benefit from a hot cache line and avoid the receiver, possibly
    on another cpu, to dirty this cache line himself.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 95766fff ([UDP]: Add memory accounting.),
    each received packet needs one extra sock_lock()/sock_release() pair.

    This added latency because of possible backlog handling. Then later,
    ticket spinlocks added yet another latency source in case of DDOS.

    This patch introduces lock_sock_bh() and unlock_sock_bh()
    synchronization primitives, avoiding one atomic operation and backlog
    processing.

    skb_free_datagram_locked() uses them instead of full blown
    lock_sock()/release_sock(). skb is orphaned inside locked section for
    proper socket memory reclaim, and finally freed outside of it.

    UDP receive path now take the socket spinlock only once.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This reverts two commits:

    fda48a0d7a8412cedacda46a9c0bf8ef9cd13559
    tcp: bind() fix when many ports are bound

    and a follow-on fix for it:

    6443bb1fc2050ca2b6585a3fa77f7833b55329ed
    ipv6: Fix inet6_csk_bind_conflict()

    It causes problems with binding listening sockets when time-wait
    sockets from a previous instance still are alive.

    It's too late to keep fiddling with this so late in the -rc
    series, and we'll deal with it in net-next-2.6 instead.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Apr, 2010

4 commits

  • Current socket backlog limit is not enough to really stop DDOS attacks,
    because user thread spend many time to process a full backlog each
    round, and user might crazy spin on socket lock.

    We should add backlog size and receive_queue size (aka rmem_alloc) to
    pace writers, and let user run without being slow down too much.

    Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
    stress situations.

    Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
    receiver can now process ~200.000 pps (instead of ~100 pps before the
    patch) on a 8 core machine.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea from Eric Dumazet.

    As for placement inside of struct sock, I tried to choose a place
    that otherwise has a 32-bit hole on 64-bit systems.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     
  • David S. Miller
     
  • RFC 1122 says the following:
    ...
    Keep-alive packets MUST only be sent when no data or
    acknowledgement packets have been received for the
    connection within an interval.
    ...

    The acknowledgement packet is reseting the keepalive
    timer but the data packet isn't. This patch fixes it by
    checking the timestamp of the last received data packet
    too when the keepalive timer expires.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Eric Dumazet
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Flavio Leitner