12 Oct, 2007

10 commits

  • This addition of lost_retrans_low to tcp_sock might be
    unnecessary, it's not clear how often lost_retrans worker is
    executed when there wasn't work to do.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Detection implemented with lost_retrans must work also when
    fastpath is taken, yet most of the queue is skipped including
    (very likely) those retransmitted skb's we're interested in.
    This problem appeared when the hints got added, which removed
    a need to always walk over the whole write queue head.
    Therefore decicion for the lost_retrans worker loop entry must
    be separated from the sacktag processing more than it was
    necessary before.

    It turns out to be problematic to optimize the worker loop
    very heavily because ack_seqs of skb may have a number of
    discontinuity points. Maybe similar approach as currently is
    implemented could be attempted but that's becoming more and
    more complex because the trend is towards less skb walking
    in sacktag marker. Trying a simple work until all rexmitted
    skbs heve been processed approach.

    Maybe after(highest_sack_end_seq, tp->high_seq) checking is not
    sufficiently accurate and causes entry too often in no-work-to-do
    cases. Since that's not known, I've separated solution to that
    from this patch.

    Noticed because of report against a related problem from TAKANO
    Ryousei . He also provided a patch to
    that part of the problem. This patch includes solution to it
    (though this patch has to use somewhat different placement).
    TAKANO's description and patch is available here:

    http://marc.info/?l=linux-netdev&m=119149311913288&w=2

    ...In short, TAKANO's problem is that end_seq the loop is using
    not necessarily the largest SACK block's end_seq because the
    current ACK may still have higher SACK blocks which are later
    by the loop.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Both sacked_out and fackets_out are directly known from how
    parameter. Since fackets_out is accurate, there's no need for
    recounting (sacked_out was previously unnecessarily counted
    in the loop anyway).

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • This is necessary for upcoming DSACK bugfix. Reduces sacktag
    length which is not very sad thing at all... :-)

    Notice that there's a need to handle out-of-mem at caller's
    place.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • It's on the way for future cutting of that function.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • This condition (plain R) can arise at least in recovery that
    is triggered after tcp_undo_loss. There isn't any reason why
    they should not be marked as lost, not marking makes in_flight
    estimator to return too large values.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • I was reading tcp_enter_loss while looking for Cedric's bug and
    noticed bytes_acked adjustment is missing from FRTO side.

    Since bytes_acked will only be used in tcp_cong_avoid, I think
    it's safe to assume RTO would be spurious. During FRTO cwnd
    will be not controlled by tcp_cong_avoid and if FRTO calls for
    conventional recovery, cwnd is adjusted and the result of wrong
    assumption is cleared from bytes_acked. If RTO was in fact
    spurious, we did normal ABC already and can continue without
    any additional adjustments.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • From RFC 3493, Section 5.2:

    IPV6_MULTICAST_IF

    Set the interface to use for outgoing multicast packets. The
    argument is the index of the interface to use. If the
    interface index is specified as zero, the system selects the
    interface (for example, by looking up the address in a routing
    table and using the resulting interface).

    This patch adds support for (index == 0) to reset the value to it's
    original state, allowing the system to choose the best interface. IPv4
    already behaves this way.

    Signed-off-by: Brian Haley
    Acked-by: David L Stevens
    Signed-off-by: David S. Miller

    Brian Haley
     
  • The patch will add MODULE_ALIAS("ip6t_") where missing,
    otherwise you will get

    ip6tables: No chain/target/match by that name

    when xt_ is not already loaded.

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jan Engelhardt
     
  • With your description I could reproduce the bug and actually you were
    completely right: the code above is incorrect. Somehow I was able to
    misread RFC1122 and mixed the roles :-(:

    When a connection is >>closed actively<>accept<< a new SYN from the remote TCP to
    reopen the connection directly from TIME-WAIT state, if it:
    [...]

    The fix is as follows: if the receiver initiated an active close, then the
    sender may reopen the connection - otherwise try to figure out if we hold
    a dead connection.

    Signed-off-by: Jozsef Kadlecsik
    Tested-by: Krzysztof Piotr Oledzki
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jozsef Kadlecsik
     

11 Oct, 2007

30 commits

  • 1) fibnl needs to be declared outside of config ifdefs,
    and also should not be explicitly initialized to NULL
    2) nl_fib_input() args are wrong for netlink_kernel_create()
    input method

    Signed-off-by: David S. Miller

    David S. Miller
     
  • As discussed before, this patch provides userland with a way to access
    relevant options in Router Advertisements, after they are processed
    and validated by the kernel. Extra options are processed in a generic
    way; this patch only exports RDNSS options described in RFC5006, but
    support to control which options are exported could be easily added.

    A new rtnetlink message type is defined, to transport Neighbor
    Discovery options, along with optional context information. At the
    moment only the address of the router sending an RDNSS option is
    included, but additional attributes may be later defined, if needed by
    new use cases.

    Signed-off-by: Pierre Ynard
    Signed-off-by: David S. Miller

    Pierre Ynard
     
  • This patch make processing netlink user -> kernel messages synchronious.
    This change was inspired by the talk with Alexey Kuznetsov about current
    netlink messages processing. He says that he was badly wrong when introduced
    asynchronious user -> kernel communication.

    The call netlink_unicast is the only path to send message to the kernel
    netlink socket. But, unfortunately, it is also used to send data to the
    user.

    Before this change the user message has been attached to the socket queue
    and sk->sk_data_ready was called. The process has been blocked until all
    pending messages were processed. The bad thing is that this processing
    may occur in the arbitrary process context.

    This patch changes nlk->data_ready callback to get 1 skb and force packet
    processing right in the netlink_unicast.

    Kernel -> user path in netlink_unicast remains untouched.

    EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
    drop, but the process remains in the cycle until the message will be fully
    processed. So, there is no need to use this kludges now.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • There are currently two ways to determine whether the netlink socket is a
    kernel one or a user one. This patch creates a single inline call for
    this purpose and unifies all the calls in the af_netlink.c

    No similar calls are found outside af_netlink.c.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • netlink_sendskb does not use third argument. Clean it and save a couple of
    bytes.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
    like outdated copy/paste from rtnetlink.c. Push them into sync with the
    original.

    Changes from v1:
    - deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy

    Signed-off-by: Denis V. Lunev
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • There is no need to process outstanding netlink user->kernel packets
    during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
    anymore.

    Normal code path is the following:
    netlink_sendmsg
    netlink_unicast
    netlink_sendskb
    skb_queue_tail
    netlink_data_ready
    rtnetlink_rcv
    mutex_lock(&rtnl_mutex);
    netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
    mutex_unlock(&rtnl_mutex);

    So, it is possible, that packets can be present in the rtnl->sk_receive_queue
    during rtnl_unlock, but there is no need to process them at that moment as
    rtnetlink_rcv for that packet is pending.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • If kernel_accept() returns an error, it may pass back a pointer to
    freed memory (which the caller should ignore). Make it pass back NULL
    instead for better safety.

    Signed-off-by: Tony Battersby
    Signed-off-by: David S. Miller

    Tony Battersby
     
  • Expansion of original idea from Denis V. Lunev

    Add robustness and locking to the local_port_range sysctl.
    1. Enforce that low < high when setting.
    2. Use seqlock to ensure atomic update.

    The locking might seem like overkill, but there are
    cases where sysadmin might want to change value in the
    middle of a DoS attack.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Add port randomization rather than a simple fixed rover
    for use with SCTP. This makes it act similar to TCP, UDP, DCCP
    when allocating ports.

    No longer need port_alloc_lock as well (suggestion by Brian Haley).

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • The fourth parameter of /proc/net/psched is supposed to show the timer
    resultion and is used by HTB userspace to calculate the necessary
    burst rate. Currently we show the clock resolution, which results in a
    too low burst rate when the two differ.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • This patch makes the IPv4 x->type->input functions return the next protocol
    instead of setting it directly. This is identical to how we do things in
    IPv6 and will help us merge common code on the input path.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch moves the setting of the IP length and checksum fields out of
    the transforms and into the xfrmX_output functions. This would help future
    efforts in merging the transforms themselves.

    It also adds an optimisation to ipcomp due to the fact that the transport
    offset is guaranteed to be zero.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since
    they're identical to the IPv4 versions. Duplicating them would only create
    problems for ourselves later when we need to add things like extended
    sequence numbers.

    I've also added transport header type conversion headers for these types
    which are now used by the transforms.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The IPv6 calling convention for x->mode->output is more general and could
    help an eventual protocol-generic x->type->output implementation. This
    patch adopts it for IPv4 as well and modifies the IPv4 type output functions
    accordingly.

    It also rewrites the IPv6 mac/transport header calculation to be based off
    the network header where practical.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch changes the calling convention so that on entry from
    x->mode->output and before entry into x->type->output skb->data
    will point to the payload instead of the IP header.

    This is essentially a redistribution of skb_push/skb_pull calls
    with the aim of minimising them on the common path of tunnel +
    ESP.

    It'll also let us use the same calling convention between IPv4
    and IPv6 with the next patch.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The beet output function completely kills any extension headers by replacing
    them with the IPv6 header. This is because it essentially ignores the
    result of ip6_find_1stfragopt by simply acting as if there aren't any
    extension headers.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • I pointed this out back when this patch was first proposed but it looks like
    it got lost along the way.

    The checksum only needs to be ignored for NAT-T in transport mode where
    we lose the original inner addresses due to NAT. With BEET the inner
    addresses will be intact so the checksum remains valid.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • To judge the timing for DAD, netif_carrier_ok() is used. However,
    there is a possibility that dev->qdisc stays noop_qdisc even if
    netif_carrier_ok() returns true. In that case, DAD NS is not sent out.
    We need to defer the IPv6 device initialization until a valid qdisc
    is specified.

    Signed-off-by: Mitsuru Chinen
    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Mitsuru Chinen
     
  • The unregister_netdevice() and dev_change_net_namespace()
    both check for dev->flags to be IFF_UP before calling the
    dev_close(), but the dev_close() checks for IFF_UP itself,
    so remove those unneeded checks.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Follows own function for each task principle, this is really
    somewhat separate task being done in sacktag. Also reduces
    indentation.

    In addition, added ack_seq local var to break some long
    lines & fixed coding style things.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Just switch to the consolidated code.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Just switch to the consolidated code

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Just switch to the consolidated code.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Just switch to the consolidated calls.

    ipt_recent() has to initialize the private, so use
    the __seq_open_private() helper.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This concerns the ipv4 and ipv6 code mostly, but also the netlink
    and unix sockets.

    The netlink code is an example of how to use the __seq_open_private()
    call - it saves the net namespace on this private.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The decryption handlers will skip the frame if the RX_FLAG_DECRYPTED
    flag is set, so the early flag setting introduced by Johannes breaks
    decryption. To work around this, call the handlers first and then set
    the flag.

    Signed-off-by: Mattias Nissler
    Signed-off-by: John W. Linville

    Mattias Nissler
     
  • Problem description by Daniel Drake :

    "This sequence of events causes loss of connectivity:

    ifconfig eth7 down
    iwconfig eth7 mode monitor
    ifconfig eth7 up
    ifconfig eth7 down
    iwconfig eth7 mode managed

    At this point you are associated but TX does not work. This is because
    the eth7 hard_start_xmit is still ieee80211_monitor_start_xmit."

    The problem is caused by ieee80211_if_set_type checking for a non-zero
    hard_start_xmit pointer value in order to avoid changing that value for
    master devices. The fix is to make that check more explicitly linked to
    master devices rather than simply checking if the value has been
    previously set.

    CC: Daniel Drake
    Acked-by: Michael Wu
    Signed-off-by: John W. Linville

    John W. Linville
     
  • This patch releases the lock on the state before calling x->type->output.
    It also adds the lock to the spots where they're currently needed.

    Most of those places (all except mip6) are expected to disappear with
    async crypto.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds locking so that when we're copying non-atomic fields such as
    life-time or coaddr to user-space we don't get a partial result.

    For af_key I've changed every instance of pfkey_xfrm_state2msg apart from
    expiration notification to include the keys and life-times. This is in-line
    with XFRM behaviour.

    The actual cases affected are:

    * pfkey_getspi: No change as we don't have any keys to copy.
    * key_notify_sa:
    + ADD/UPD: This wouldn't work otherwise.
    + DEL: It can't hurt.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu