18 Sep, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (66 commits)
    be2net: fix some cmds to use mccq instead of mbox
    atl1e: fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA
    pkt_sched: Fix qstats.qlen updating in dump_stats
    ipv6: Log the affected address when DAD failure occurs
    wl12xx: Fix print_mac() conversion.
    af_iucv: fix race when queueing skbs on the backlog queue
    af_iucv: do not call iucv_sock_kill() twice
    af_iucv: handle non-accepted sockets after resuming from suspend
    af_iucv: fix race in __iucv_sock_wait()
    iucv: use correct output register in iucv_query_maxconn()
    iucv: fix iucv_buffer_cpumask check when calling IUCV functions
    iucv: suspend/resume error msg for left over pathes
    wl12xx: switch to %pM to print the mac address
    b44: the poll handler b44_poll must not enable IRQ unconditionally
    ipv6: Ignore route option with ROUTER_PREF_INVALID
    bonding: make ab_arp select active slaves as other modes
    cfg80211: fix SME connect
    rc80211_minstrel: fix contention window calculation
    ssb/sdio: fix printk format warnings
    p54usb: add Zcomax XG-705A usbid
    ...

    Linus Torvalds
     

16 Sep, 2009

2 commits

  • I have recently came across a preemption imbalance detected by:

    huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
    ------------[ cut here ]------------
    kernel BUG at /usr/src/linux/kernel/timer.c:664!
    invalid opcode: 0000 [1] PREEMPT SMP

    with ffffffff80644630 being inet_twdr_hangman().

    This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
    bit, so I looked at what might have caused it.

    One thing that struck me as strange is tcp_twsk_destructor(), as it
    calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
    detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
    as far as I can tell.

    Signed-off-by: Robert Varga
    Signed-off-by: David S. Miller

    Robert Varga
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

15 Sep, 2009

4 commits

  • This patch fixes commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10. The approach
    there is to call dev_close()/dev_open() whenever the device type is changed in
    order to remap the device IP multicast addresses to HW multicast addresses.
    This approach suffers from 2 drawbacks:

    *. It assumes tha the device is UP when calling dev_close(), or otherwise
    dev_close() has no affect. It is worth to mention that initscripts (Redhat)
    and sysconfig (Suse) doesn't act the same in this matter.
    *. dev_close() has other side affects, like deleting entries from the routing
    table, which might be unnecessary.

    The fix here is to directly remap the IP multicast addresses to HW multicast
    addresses for a bonding device that changes its type, and nothing else.

    Reported-by: Jason Gunthorpe
    Signed-off-by: Moni Shoua
    Signed-off-by: David S. Miller

    Moni Shoua
     
  • It was once upon time so that snd_sthresh was a 16-bit quantity.
    ...That has not been true for long period of time. I run across
    some ancient compares which still seem to trust such legacy.
    Put all that magic into a single place, I hopefully found all
    of them.

    Compile tested, though linking of allyesconfig is ridiculous
    nowadays it seems.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Remove long removed "inet_protocol_base" declaration.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
    netxen: update copyright
    netxen: fix tx timeout recovery
    netxen: fix file firmware leak
    netxen: improve pci memory access
    netxen: change firmware write size
    tg3: Fix return ring size breakage
    netxen: build fix for INET=n
    cdc-phonet: autoconfigure Phonet address
    Phonet: back-end for autoconfigured addresses
    Phonet: fix netlink address dump error handling
    ipv6: Add IFA_F_DADFAILED flag
    net: Add DEVTYPE support for Ethernet based devices
    mv643xx_eth.c: remove unused txq_set_wrr()
    ucc_geth: Fix hangs after switching from full to half duplex
    ucc_geth: Rearrange some code to avoid forward declarations
    phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
    drivers/net/phy: introduce missing kfree
    drivers/net/wan: introduce missing kfree
    net: force bridge module(s) to be GPL
    Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
    ...

    Fixed up trivial conflicts:

    - arch/x86/include/asm/socket.h

    converted to in the x86 tree. The generic
    header has the same new #define's, so that works out fine.

    - drivers/net/tun.c

    fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
    switched over to using 'tun->socket.sk' instead of the redundantly
    available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
    to the TUN driver") which added a new 'tun->sk' use.

    Noted in 'next' by Stephen Rothwell.

    Linus Torvalds
     

11 Sep, 2009

2 commits


09 Sep, 2009

1 commit


03 Sep, 2009

2 commits

  • This fixed a lockdep warning which appeared when doing stress
    memory tests over NFS:

    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

    page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

    mount_root => nfs_root_data => tcp_close => lock sk_lock =>
    tcp_send_fin => alloc_skb_fclone => page reclaim

    David raised a concern that if the allocation fails in tcp_send_fin(), and it's
    GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
    for the allocation to succeed.

    But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
    weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
    loop endlessly under memory pressure.

    CC: Arnaldo Carvalho de Melo
    CC: David S. Miller
    CC: Herbert Xu
    Signed-off-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Wu Fengguang
     
  • Christoph Lameter pointed out that packet drops at qdisc level where not
    accounted in SNMP counters. Only if application sets IP_RECVERR, drops
    are reported to user (-ENOBUFS errors) and SNMP counters updated.

    IP_RECVERR is used to enable extended reliable error message passing,
    but these are not needed to update system wide SNMP stats.

    This patch changes things a bit to allow SNMP counters to be updated,
    regardless of IP_RECVERR being set or not on the socket.

    Example after an UDP tx flood
    # netstat -s
    ...
    IP:
    1487048 outgoing packets dropped
    ...
    Udp:
    ...
    SndbufErrors: 1487048

    send() syscalls, do however still return an OK status, to not
    break applications.

    Note : send() manual page explicitly says for -ENOBUFS error :

    "The output queue for a network interface was full.
    This generally indicates that the interface has stopped sending,
    but may be caused by transient congestion.
    (Normally, this does not occur in Linux. Packets are just silently
    dropped when a device queue overflows.) "

    This is not true for IP_RECVERR enabled sockets : a send() syscall
    that hit a qdisc drop returns an ENOBUFS error.

    Many thanks to Christoph, David, and last but not least, Alexey !

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2009

4 commits


01 Sep, 2009

4 commits

  • RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
    which may represent a number of allowed retransmissions or a timeout value.
    Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
    in number of allowed retransmissions.

    For any desired threshold R2 (by means of time) one can specify tcp_retries2
    (by means of number of retransmissions) such that TCP will not time out
    earlier than R2. This is the case, because the RTO schedule follows a fixed
    pattern, namely exponential backoff.

    However, the RTO behaviour is not predictable any more if RTO backoffs can be
    reverted, as it is the case in the draft
    "Make TCP more Robust to Long Connectivity Disruptions"
    (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).

    In the worst case TCP would time out a connection after 3.2 seconds, if the
    initial RTO equaled MIN_RTO and each backoff has been reverted.

    This patch introduces a function retransmits_timed_out(N),
    which calculates the timeout of a TCP connection, assuming an initial
    RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.

    Whenever timeout decisions are made by comparing the retransmission counter
    to some value N, this function can be used, instead.

    The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
    can occur than the value indicates. However, it yields a timeout which is
    similar to the one of an unpatched, exponentially backing off TCP in the same
    scenario. As no application could rely on an RTO greater than MIN_RTO, there
    should be no risk of a regression.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • Here, an ICMP host/network unreachable message, whose payload fits to
    TCP's SND.UNA, is taken as an indication that the RTO retransmission has
    not been lost due to congestion, but because of a route failure
    somewhere along the path.
    With true congestion, a router won't trigger such a message and the
    patched TCP will operate as standard TCP.

    This patch reverts one RTO backoff, if an ICMP host/network unreachable
    message, whose payload fits to TCP's SND.UNA, arrives.
    Based on the new RTO, the retransmission timer is reset to reflect the
    remaining time, or - if the revert clocked out the timer - a retransmission
    is sent out immediately.
    Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
    there have been retransmissions and reversible backoffs, already.

    Changes from v2:
    1) Renaming of skb in tcp_v4_err() moved to another patch.
    2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
    3) Fixed code comments.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to
    disambiguate from another sk_buff variable, which will be introduced
    in a separate patch.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

29 Aug, 2009

6 commits

  • Remove the copy of the MD5 authentication key from tcp_check_req().
    This key has already been copied by tcp_v4_syn_recv_sock() or
    tcp_v6_syn_recv_sock().

    Signed-off-by: John Dykstra
    Signed-off-by: David S. Miller

    John Dykstra
     
  • There is a race condition in the time-wait sockets code that can lead
    to premature termination of FIN_WAIT2 and, subsequently, to RST
    generation when the FIN,ACK from the peer finally arrives:

    Time TCP header
    0.000000 30755 > http [SYN] Seq=0 Win=2920 Len=0 MSS=1460 TSV=282912 TSER=0
    0.000008 http > 30755 aSYN, ACK] Seq=0 Ack=1 Win=2896 Len=0 MSS=1460 TSV=...
    0.136899 HEAD /1b.html?n1Lg=v1 HTTP/1.0 [Packet size limited during capture]
    0.136934 HTTP/1.0 200 OK [Packet size limited during capture]
    0.136945 http > 30755 [FIN, ACK] Seq=187 Ack=207 Win=2690 Len=0 TSV=270521...
    0.136974 30755 > http [ACK] Seq=207 Ack=187 Win=2734 Len=0 TSV=283049 TSER=...
    0.177983 30755 > http [ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283089 TSER=...
    0.238618 30755 > http [FIN, ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283151...
    0.238625 http > 30755 [RST] Seq=188 Win=0 Len=0

    Say twdr->slot = 1 and we are running inet_twdr_hangman and in this
    instance inet_twdr_do_twkill_work returns 1. At that point we will
    mark slot 1 and schedule inet_twdr_twkill_work. We will also make
    twdr->slot = 2.

    Next, a connection is closed and tcp_time_wait(TCP_FIN_WAIT2, timeo)
    is called which will create a new FIN_WAIT2 time-wait socket and will
    place it in the last to be reached slot, i.e. twdr->slot = 1.

    At this point say inet_twdr_twkill_work will run which will start
    destroying the time-wait sockets in slot 1, including the just added
    TCP_FIN_WAIT2 one.

    To avoid this issue we increment the slot only if all entries in the
    slot have been purged.

    This change may delay the slots cleanup by a time-wait death row
    period but only if the worker thread didn't had the time to run/purge
    the current slot in the next period (6 seconds with default sysctl
    settings). However, on such a busy system even without this change we
    would probably see delays...

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     
  • Here is rework and cleanup of the resize function.

    Some bugs we had. We were using ->parent when we should use
    node_parent(). Also we used ->parent which is not assigned by
    inflate in inflate loop.

    Also a fix to set thresholds to power 2 to fit halve
    and double strategy.

    max_resize is renamed to max_work which better indicates
    it's function.

    Reaching max_work is not an error, so warning is removed.
    max_work only limits amount of work done per resize.
    (limits CPU-usage, outstanding memory etc).

    The clean-up makes it relatively easy to add fixed sized
    root-nodes if we would like to decrease the memory pressure
    on routers with large routing tables and dynamic routing.
    If we'll need that...

    Its been tested with 280k routes.

    Work done together with Robert Olsson.

    Signed-off-by: Jens Låås
    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Jens Låås
     
  • While doing some forwarding benchmarks, I noticed
    ip_rt_send_redirect() is rather expensive, even if send_redirects is
    false for the device.

    Fix is to avoid two atomic ops, we dont really need to take a
    reference on in_dev

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Introduce keepalive_probes(tp) helper, and use it, like
    keepalive_time_when(tp) and keepalive_intvl_when(tp)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Aug, 2009

1 commit


25 Aug, 2009

3 commits


24 Aug, 2009

1 commit


20 Aug, 2009

1 commit


15 Aug, 2009

1 commit

  • The GRE header length should be subtracted when the tunnel MTU is
    calculated. This just corrects for the associativity change
    introduced by commit 42aa916265d740d66ac1f17290366e9494c884c2
    ("gre: Move MTU setting out of ipgre_tunnel_bind_dev").

    Signed-off-by: Tom Goff
    Signed-off-by: David S. Miller

    Tom Goff
     

14 Aug, 2009

2 commits

  • Conflicts:
    arch/sparc/kernel/smp_64.c
    arch/x86/kernel/cpu/perf_counter.c
    arch/x86/kernel/setup_percpu.c
    drivers/cpufreq/cpufreq_ondemand.c
    mm/percpu.c

    Conflicts in core and arch percpu codes are mostly from commit
    ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
    num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
    the first chunk allocators into mm/percpu.c, the changes are moved
    from arch code to mm/percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The networking code checks CAP_SYS_MODULE before using request_module() to
    try to load a kernel module. While this seems reasonable it's actually
    weakening system security since we have to allow CAP_SYS_MODULE for things
    like /sbin/ip and bluetoothd which need to be able to trigger module loads.
    CAP_SYS_MODULE actually grants those binaries the ability to directly load
    any code into the kernel. We should instead be protecting modprobe and the
    modules on disk, rather than granting random programs the ability to load code
    directly into the kernel. Instead we are going to gate those networking checks
    on CAP_NET_ADMIN which still limits them to root but which does not grant
    those processes the ability to load arbitrary code into the kernel.

    Signed-off-by: Eric Paris
    Acked-by: Serge Hallyn
    Acked-by: Paul Moore
    Acked-by: David S. Miller
    Signed-off-by: James Morris

    Eric Paris
     

10 Aug, 2009

5 commits