18 Jun, 2008

7 commits

  • genetlink has a circular locking dependency when dumping the registered
    families:

    - dump start:
    genl_rcv() : take genl_mutex
    genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex
    netlink_dump_start(),
    netlink_dump() : take nlk->cb_mutex
    ctrl_dumpfamily() : try to detect this case and not take genl_mutex a
    second time

    - dump continuance:
    netlink_rcv() : call netlink_dump
    netlink_dump : take nlk->cb_mutex
    ctrl_dumpfamily() : take genl_mutex

    Register genl_lock as callback mutex with netlink to fix this. This slightly
    widens an already existing module unload race, the genl ops used during the
    dump might go away when the module is unloaded. Thomas Graf is working on a
    seperate fix for this.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • This reverts commit 608961a5eca8d3c6bd07172febc27b5559408c5d.

    The problem is that the mac80211 stack not only needs to be able to
    muck with the link-level headers, it also might need to mangle all of
    the packet data if doing sw wireless encryption.

    This fixes kernel bugzilla #10903. Thanks to Didier Raboud (for the
    bugzilla report), Andrew Prince (for bisecting), Johannes Berg (for
    bringing this bisection analysis to my attention), and Ilpo (for
    trying to analyze this purely from the TCP side).

    In 2.6.27 we can take another stab at this, by using something like
    skb_cow_data() when the TX path of mac80211 ends up with a non-NULL
    tx->key. The ESP protocol code in the IPSEC stack can be used as a
    model for implementation.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The unix_dgram_sendmsg routine implements a (somewhat crude)
    form of receiver-imposed flow control by comparing the length of the
    receive queue of the 'peer socket' with the max_ack_backlog value
    stored in the corresponding sock structure, either blocking
    the thread which caused the send-routine to be called or returning
    EAGAIN. This routine is used by both SOCK_DGRAM and SOCK_SEQPACKET
    sockets. The poll-implementation for these socket types is
    datagram_poll from core/datagram.c. A socket is deemed to be writeable
    by this routine when the memory presently consumed by datagrams
    owned by it is less than the configured socket send buffer size. This
    is always wrong for connected PF_UNIX non-stream sockets when the
    abovementioned receive queue is currently considered to be full.
    'poll' will then return, indicating that the socket is writeable, but
    a subsequent write result in EAGAIN, effectively causing an
    (usual) application to 'poll for writeability by repeated send request
    with O_NONBLOCK set' until it has consumed its time quantum.

    The change below uses a suitably modified variant of the datagram_poll
    routines for both type of PF_UNIX sockets, which tests if the
    recv-queue of the peer a socket is connected to is presently
    considered to be 'full' as part of the 'is this socket
    writeable'-checking code. The socket being polled is additionally
    put onto the peer_wait wait queue associated with its peer, because the
    unix_dgram_sendmsg routine does a wake up on this queue after a
    datagram was received and the 'other wakeup call' is done implicitly
    as part of skb destruction, meaning, a process blocked in poll
    because of a full peer receive queue could otherwise sleep forever
    if no datagram owned by its socket was already sitting on this queue.
    Among this change is a small (inline) helper routine named
    'unix_recvq_full', which consolidates the actual testing code (in three
    different places) into a single location.

    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat
     
  • When generating the ip header for the transformed packet we just copy
    the frag_off field of the ip header from the original packet to the ip
    header of the new generated packet. If we receive a packet as a chain
    of fragments, all but the last of the new generated packets have the
    IP_MF flag set. We have to mask the frag_off field to only keep the
    IP_DF flag from the original packet. This got lost with git commit
    36cf9acf93e8561d9faec24849e57688a81eb9c5 ("[IPSEC]: Separate
    inner/outer mode processing on output")

    Signed-off-by: Steffen Klassert
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • The H.245 helper is not registered/unregistered, but assigned to
    connections manually from the Q.931 helper. This means on unload
    existing expectations and connections using the helper are not
    cleaned up, leading to the following oops on module unload:

    CPU 0 Unable to handle kernel paging request at virtual address c00a6828, epc == 802224dc, ra == 801d4e7c
    Oops[#1]:
    Cpu 0
    $ 0 : 00000000 00000000 00000004 c00a67f0
    $ 4 : 802a5ad0 81657e00 00000000 00000000
    $ 8 : 00000008 801461c8 00000000 80570050
    $12 : 819b0280 819b04b0 00000006 00000000
    $16 : 802a5a60 80000000 80b46000 80321010
    $20 : 00000000 00000004 802a5ad0 00000001
    $24 : 00000000 802257a8
    $28 : 802a4000 802a59e8 00000004 801d4e7c
    Hi : 0000000b
    Lo : 00506320
    epc : 802224dc ip_conntrack_help+0x38/0x74 Tainted: P
    ra : 801d4e7c nf_iterate+0xbc/0x130
    Status: 1000f403 KERNEL EXL IE
    Cause : 00800008
    BadVA : c00a6828
    PrId : 00019374
    Modules linked in: ip_nat_pptp ip_conntrack_pptp ath_pktlog wlan_acl wlan_wep wlan_tkip wlan_ccmp wlan_xauth ath_pci ath_dev ath_dfs ath_rate_atheros wlan ath_hal ip_nat_tftp ip_conntrack_tftp ip_nat_ftp ip_conntrack_ftp pppoe ppp_async ppp_deflate ppp_mppe pppox ppp_generic slhc
    Process swapper (pid: 0, threadinfo=802a4000, task=802a6000)
    Stack : 801e7d98 00000004 802a5a60 80000000 801d4e7c 801d4e7c 802a5ad0 00000004
    00000000 00000000 801e7d98 00000000 00000004 802a5ad0 00000000 00000010
    801e7d98 80b46000 802a5a60 80320000 80000000 801d4f8c 802a5b00 00000002
    80063834 00000000 80b46000 802a5a60 801e7d98 80000000 802ba854 00000000
    81a02180 80b7e260 81a021b0 819b0000 819b0000 80570056 00000000 00000001
    ...
    Call Trace:
    [] ip_finish_output+0x0/0x23c
    [] nf_iterate+0xbc/0x130
    [] nf_iterate+0xbc/0x130
    [] ip_finish_output+0x0/0x23c
    [] ip_finish_output+0x0/0x23c
    [] nf_hook_slow+0x9c/0x1a4

    One way to fix this would be to split helper cleanup from the unregistration
    function and invoke it for the H.245 helper, but since ctnetlink needs to be
    able to find the helper for synchonization purposes, a better fix is to
    register it normally and make sure its not assigned to connections during
    helper lookup. The missing l3num initialization is enough for this, this
    patch changes it to use AF_UNSPEC to make it more explicit though.

    Reported-by: liannan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Properly free h323_buffer when helper registration fails.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Fix three ct_extend/NAT extension related races:

    - When cleaning up the extension area and removing it from the bysource hash,
    the nat->ct pointer must not be set to NULL since it may still be used in
    a RCU read side

    - When replacing a NAT extension area in the bysource hash, the nat->ct
    pointer must be assigned before performing the replacement

    - When reallocating extension storage in ct_extend, the old memory must
    not be freed immediately since it may still be used by a RCU read side

    Possibly fixes https://bugzilla.redhat.com/show_bug.cgi?id=449315
    and/or http://bugzilla.kernel.org/show_bug.cgi?id=10875

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

17 Jun, 2008

11 commits

  • From: Eric Kinzie
    Signed-off-by: Chas Williams
    Signed-off-by: David S. Miller

    Eric Kinzie
     
  • It happens that if a packet arrives in a VC between the call to open it on
    the hardware and the call to change the backend to br2684, br2684_regvcc
    processes the packet and oopses dereferencing skb->dev because it is
    NULL before the call to br2684_push().

    Signed-off-by: Jorge Boncompte [DTI2]
    Signed-off-by: Chas Williams

    Jorge Boncompte [DTI2]
     
  • 1) Remove ICMP_MIN_LENGTH, as it is unused.

    2) Remove unneeded tcp_v4_send_check() declaration.

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     
  • I just noticed "cat /proc/net/raw" was buggy, missing '\n' separators.

    I believe this was introduced by commit 8cd850efa4948d57a2ed836911cfd1ab299e89c6
    ([RAW]: Cleanup IPv4 raw_seq_show.)

    This trivial patch restores correct behavior, and applies to current
    Linus tree (should also be applied to stable tree as well.)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Selected device feature bits can be propagated to VLAN devices, so we
    can make use of TX checksum offload and TSO on VLAN-tagged packets.
    However, if the physical device does not do VLAN tag insertion or
    generic checksum offload then the test for TX checksum offload in
    dev_queue_xmit() will see a protocol of htons(ETH_P_8021Q) and yield
    false.

    This splits the checksum offload test into two functions:

    - can_checksum_protocol() tests a given protocol against a feature bitmask

    - dev_can_checksum() first tests the skb protocol against the device
    features; if that fails and the protocol is htons(ETH_P_8021Q) then
    it tests the encapsulated protocol against the effective device
    features for VLANs

    Signed-off-by: Ben Hutchings
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • Right now, any time we set a primary transport we set
    the changeover_active flag. As a result, we invoke SFR-CACC
    even when there has been no changeover events.

    Only set changeover_active, when there is a true changeover
    event, i.e. we had a primary path and we are changing to
    another transport.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • This patch remove the proc fs entry which has been created if fail to
    set up proc fs entry for the SCTP protocol.

    Signed-off-by: Wei Yongjun
    Acked-by: Neil Horman
    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • Ingo's system is still seeing strange behavior, and he
    reports that is goes away if the rest of the deferred
    accept changes are reverted too.

    Therefore this reverts e4c78840284f3f51b1896cf3936d60a6033c4d2c
    ("[TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack") and
    539fae89bebd16ebeafd57a87169bc56eb530d76 ("[TCP]: TCP_DEFER_ACCEPT
    updates - defer timeout conflicts with max_thresh").

    Just like the other revert, these ideas can be revisited for
    2.6.27

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We've introduced extra need of compat layer for ip_tunnel_prl{}
    for PRL (Potential Router List) management. Though compat_ioctl
    is still missing in ipv4/ipv6, let's make the interface more
    straight-forward and eliminate extra need for nasty compat layer
    anyway since the interface is new for 2.6.26.

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki
     
  • Add a htb_hysteresis parameter to htb_sch.ko and by sysfs magic make
    it runtime adjustable via
    /sys/module/sch_htb/parameters/htb_hysteresis mode 640.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Martin Devera
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • The HTB hysteresis mode reduce the CPU load, but at the
    cost of scheduling accuracy.

    On ADSL links (512 kbit/s upstream), this inaccuracy introduce
    significant jitter, enought to disturbe VoIP. For details see my
    masters thesis (http://www.adsl-optimizer.dk/thesis/), chapter 7,
    section 7.3.1, pp 69-70.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Martin Devera
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

15 Jun, 2008

1 commit


14 Jun, 2008

2 commits


13 Jun, 2008

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    tcp: Revert 'process defer accept as established' changes.
    ipv6: Fix duplicate initialization of rawv6_prot.destroy
    bnx2x: Updating the Maintainer
    net: Eliminate flush_scheduled_work() calls while RTNL is held.
    drivers/net/r6040.c: correct bad use of round_jiffies()
    fec_mpc52xx: MPC52xx_MESSAGES_DEFAULT: 2nd NETIF_MSG_IFDOWN => IFUP
    ipg: fix receivemode IPG_RM_RECEIVEMULTICAST{,HASH} in ipg_nic_set_multicast_list()
    netfilter: nf_conntrack: fix ctnetlink related crash in nf_nat_setup_info()
    netfilter: Make nflog quiet when no one listen in userspace.
    ipv6: Fail with appropriate error code when setting not-applicable sockopt.
    ipv6: Check IPV6_MULTICAST_LOOP option value.
    ipv6: Check the hop limit setting in ancillary data.
    ipv6 route: Fix route lifetime in netlink message.
    ipv6 mcast: Check address family of gf_group in getsockopt(MS_FILTER).
    dccp: Bug in initial acknowledgment number assignment
    dccp ccid-3: X truncated due to type conversion
    dccp ccid-3: TFRC reverse-lookup Bug-Fix
    dccp ccid-2: Bug-Fix - Ack Vectors need to be ignored on request sockets
    dccp: Fix sparse warnings
    dccp ccid-3: Bug-Fix - Zero RTT is possible

    Linus Torvalds
     
  • This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
    ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
    the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38
    ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

    This change causes several problems, first reported by Ingo Molnar
    as a distcc-over-loopback regression where connections were getting
    stuck.

    Ilpo Järvinen first spotted the locking problems. The new function
    added by this code, tcp_defer_accept_check(), only has the
    child socket locked, yet it is modifying state of the parent
    listening socket.

    Fixing that is non-trivial at best, because we can't simply just grab
    the parent listening socket lock at this point, because it would
    create an ABBA deadlock. The normal ordering is parent listening
    socket --> child socket, but this code path would require the
    reverse lock ordering.

    Next is a problem noticed by Vitaliy Gusev, he noted:

    ----------------------------------------
    >--- a/net/ipv4/tcp_timer.c
    >+++ b/net/ipv4/tcp_timer.c
    >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
    > goto death;
    > }
    >
    >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
    >+ tcp_send_active_reset(sk, GFP_ATOMIC);
    >+ goto death;

    Here socket sk is not attached to listening socket's request queue. tcp_done()
    will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
    release this sk) as socket is not DEAD. Therefore socket sk will be lost for
    freeing.
    ----------------------------------------

    Finally, Alexey Kuznetsov argues that there might not even be any
    real value or advantage to these new semantics even if we fix all
    of the bugs:

    ----------------------------------------
    Hiding from accept() sockets with only out-of-order data only
    is the only thing which is impossible with old approach. Is this really
    so valuable? My opinion: no, this is nothing but a new loophole
    to consume memory without control.
    ----------------------------------------

    So revert this thing for now.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In changeset 22dd485022f3d0b162ceb5e67d85de7c3806aa20
    ("raw: Raw socket leak.") code was added so that we
    flush pending frames on raw sockets to avoid leaks.

    The ipv4 part was fine, but the ipv6 part was not
    done correctly. Unlike the ipv4 side, the ipv6 code
    already has a .destroy method for rawv6_prot.

    So now there were two assignments to this member, and
    what the compiler does is use the last one, effectively
    making the ipv6 parts of that changeset a NOP.

    Fix this by removing the:

    .destroy = inet6_destroy_sock,

    line, and adding an inet6_destroy_sock() call to the
    end of raw6_destroy().

    Noticed by Al Viro.

    Signed-off-by: David S. Miller
    Acked-by: YOSHIFUJI Hideaki

    David S. Miller
     

12 Jun, 2008

9 commits


11 Jun, 2008

7 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
    net: Fix routing tables with id > 255 for legacy software
    sky2: Hold RTNL while calling dev_close()
    s2io iomem annotations
    atl1: fix suspend regression
    qeth: start dev queue after tx drop error
    qeth: Prepare-function to call s390dbf was wrong
    qeth: reduce number of kernel messages
    qeth: Use ccw_device_get_id().
    qeth: layer 3 Oops in ip event handler
    virtio: use callback on empty in virtio_net
    virtio: virtio_net free transmit skbs in a timer
    virtio: Fix typo in virtio_net_hdr comments
    virtio_net: Fix skb->csum_start computation
    ehea: set mac address fix
    sfc: Recover from RX queue flush failure
    add missing lance_* exports
    ixgbe: fix typo
    forcedeth: msi interrupts
    ipsec: pfkey should ignore events when no listeners
    pppoe: Unshare skb before anything else
    ...

    Linus Torvalds
     
  • Step 8.5 in RFC 4340 says for the newly cloned socket

    Initialize S.GAR := S.ISS,

    but what in fact the code (minisocks.c) does is

    Initialize S.GAR := S.ISR,

    which is wrong (typo?) -- fixed by the patch.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This fixes a bug in computing the inter-packet-interval t_ipi = s/X:

    scaled_div32(a, b) uses u32 for b, but in "scaled_div32(s, X)" the type of the
    sending rate `X' is u64. Since X is scaled by 2^6, this truncates rates greater
    than 2^26 Bps (~537 Mbps).

    Using full 64-bit division now.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This fixes a bug in the reverse lookup of p: given a value f(p), instead of p,
    the function returned the smallest tabulated value f(p).

    The smallest tabulated value of

    10^6 * f(p) = sqrt(2*p/3) + 12 * sqrt(3*p/8) * (32 * p^3 + p)

    for p=0.0001 is 8172.

    Since this value is scaled by 10^6, the outcome of this bug is that a loss
    of 8172/10^6 = 0.8172% was reported whenever the input was below the table
    resolution of 0.01%.

    This means that the value was over 80 times too high, resulting in large spikes
    of the initial loss interval, thus unnecessarily reducing the throughput.

    Also corrected the printk format (%u for u32).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This fixes an oversight from an earlier patch, ensuring that Ack Vectors
    are not processed on request sockets.

    The issue is that Ack Vectors must not be parsed on request sockets, since
    the Ack Vector feature depends on the selection of the (TX) CCID. During the
    initial handshake the CCIDs are undefined, and so RFC 4340, 10.3 applies:

    "Using CCID-specific options and feature options during a negotiation
    for the corresponding CCID feature is NOT RECOMMENDED [...]"

    And it is not even possible: when the server receives the Request from the
    client, the CCID and Ack vector features are undefined; when the Ack finalising
    the 3-way hanshake arrives, the request socket has not been cloned yet into a
    full socket. (This order is necessary, since otherwise the newly created socket
    would have to be destroyed whenever an option error occurred - a malicious
    hacker could simply send garbage options and exploit this.)

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This patch fixes the following sparse warnings:
    * nested min(max()) expression:
    net/dccp/ccids/ccid3.c:91:21: warning: symbol '__x' shadows an earlier one
    net/dccp/ccids/ccid3.c:91:21: warning: symbol '__y' shadows an earlier one

    * Declaration of function prototypes in .c instead of .h file, resulting in
    "should it be static?" warnings.

    * Declared "struct dccpw" static (local to dccp_probe).

    * Disabled dccp_delayed_ack() - not fully removed due to RFC 4340, 11.3
    ("Receivers SHOULD implement delayed acknowledgement timers ...").

    * Used a different local variable name to avoid
    net/dccp/ackvec.c:293:13: warning: symbol 'state' shadows an earlier one
    net/dccp/ackvec.c:238:33: originally declared here

    * Removed unused functions `dccp_ackvector_print' and `dccp_ackvec_print'.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • In commit $(825de27d9e40b3117b29a79d412b7a4b78c5d815) (from 27th May, commit
    message `dccp ccid-3: Fix "t_ipi explosion" bug'), the CCID-3 window counter
    computation was fixed to cope with RTTs < 4 microseconds.

    Such RTTs can be found e.g. when running CCID-3 over loopback. The fix removed
    a check against RTT < 4, but introduced a divide-by-zero bug.

    All steady-state RTTs in DCCP are filtered using dccp_sample_rtt(), which
    ensures non-zero samples. However, a zero RTT is possible on initialisation,
    when there is no RTT sample from the Request/Response exchange.

    The fix is to use the fallback-RTT from RFC 4340, 3.4.

    This is also better than just fixing update_win_count() since it allows other
    parts of the code to always assume that the RTT is non-zero during the time
    that the CCID is used.

    Signed-off-by: Gerrit Renker

    Gerrit Renker