11 Sep, 2009

1 commit


10 Sep, 2009

1 commit

  • When new child qdiscs are attached to the mq qdisc, they are actually
    attached as root qdiscs to the device queues. The lock selection for
    new estimators incorrectly picks the root lock of the existing and
    to be replaced qdisc, which results in a use-after-free once the old
    qdisc has been destroyed.

    Mark mq qdisc instances with a new flag and treat qdiscs attached to
    mq as children similar to regular root qdiscs.

    Additionally prevent estimators from being attached to the mq qdisc
    itself since it only updates its byte and packet counters during dumps.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

06 Sep, 2009

2 commits

  • This patch adds a classful dummy scheduler which can be used as root qdisc
    for multiqueue devices and exposes each device queue as a child class.

    This allows to address queues individually and graft them similar to regular
    classes. Additionally it presents an accumulated view of the statistics of
    all real root qdiscs in the dummy root.

    Two new callbacks are added to the qdisc_ops and qdisc_class_ops:

    - cl_ops->select_queue selects the tx queue number for new child classes.

    - qdisc_ops->attach() overrides root qdisc device grafting to attach
    non-shared qdiscs to the queues.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    David S. Miller
     
  • It will be used in a following patch by the multiqueue qdisc.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

05 Sep, 2009

9 commits

  • This shrinks the size of struct sctp_association a little.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Vlad Yasevich

    Wei Yongjun
     
  • This patch introduces a new sysctl option to make IPv4 Address Scoping
    configurable .

    In networking environments where DNAT rules in iptables prerouting
    chains convert destination IP's to link-local/private IP addresses,
    SCTP connections fail to establish as the INIT chunk is dropped by the
    kernel due to address scope match failure.
    For example to support overlapping IP addresses (same IP address with
    different vlan id) a Layer-5 application listens on link local IP's,
    and there is a DNAT rule that maps the destination IP to a link local
    IP. Such applications never get the SCTP INIT if the address-scoping
    draft is strictly followed.

    This sysctl configuration allows SCTP to function in such
    unconventional networking environments.

    Sysctl options:
    0 - Disable IPv4 address scoping draft altogether
    1 - Enable IPv4 address scoping (default, current behavior)
    2 - Enable address scoping but allow IPv4 private addresses in init/init-ack
    3 - Enable address scoping but allow IPv4 link local address in init/init-ack

    Signed-off-by: Bhaskar Dutta
    Signed-off-by: Vlad Yasevich

    Bhaskar Dutta
     
  • This shrinks the size of sctp_packet a little.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • We had a bug that we never stored the user-defined value for
    MAXSEG when setting the value on an association. Thus future
    PMTU events ended up re-writing the frag point and increasing
    it past user limit. Additionally, when setting the option on
    the socket/endpoint, we effect all current associations, which
    is against spec.

    Now, we store the user 'maxseg' value along with the computed
    'frag_point'. We inherit 'maxseg' from the socket at association
    creation and use it as an upper limit for 'frag_point' when its
    set.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • SCTP will delay the last part of a large write due to NAGLE, if that
    part is smaller then MTU. Since we are doing large writes, we might
    as well send the last portion now instead of waiting untill the next
    large write happens. The small portion will be sent as is regardless,
    so it's better to not delay it.

    This is a result of much discussions with Wei Yongjun
    and Doug Graham . Many thanks go out to them.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • SCTP has a problem that when small chunks are used, it is possible
    to exhaust the receiver buffer without fully closing receive window.
    This happens due to all overhead that we have account for with small
    messages. To fix this, when receive buffer is exceeded, we'll drop
    the window to 0 and save the 'drop' portion. When application starts
    reading data and freeing up recevie buffer space, we'll wait until
    we've reached the 'drop' window and then add back this 'drop' one
    mtu at a time. This worked well in testing and under stress produced
    rather even recovery.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • Currenlty, sctp breaks up user messages into fragments and
    sends each fragment to the lower layer by itself. This means
    that for each fragment we go all the way down the stack
    and back up. This also discourages bundling of multiple
    fragments when they can fit into a sigle packet (ex: due
    to user setting a low fragmentation threashold).

    We introduce a new command SCTP_CMD_SND_MSG and hand the
    whole message down state machine. The state machine and
    the side-effect parser will cork the queue, add all chunks
    from the message to the queue, and then un-cork the queue
    thus causing the chunks to get transmitted.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • If a socket has a lot of association that are in the process of
    of being closed/aborted, it is possible for a remote to establish
    new associations during the time period that the old ones are shutting
    down. If this was a result of a close() call, there will be no socket
    and will cause a memory leak. We'll prevent this by setting the
    socket state to CLOSING and disallow new associations when in this state.

    Signed-off-by: Vlad Yasevich

    Vlad Yasevich
     
  • This patch removes an unused union definition (sctp_cmsg_data_t)
    from include/net/sctp/user.h.

    Signed-off-by: Rami Rosen
    Signed-off-by: Vlad Yasevich

    Rami Rosen
     

03 Sep, 2009

2 commits

  • This fixed a lockdep warning which appeared when doing stress
    memory tests over NFS:

    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

    page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

    mount_root => nfs_root_data => tcp_close => lock sk_lock =>
    tcp_send_fin => alloc_skb_fclone => page reclaim

    David raised a concern that if the allocation fails in tcp_send_fin(), and it's
    GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
    for the allocation to succeed.

    But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
    weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
    loop endlessly under memory pressure.

    CC: Arnaldo Carvalho de Melo
    CC: David S. Miller
    CC: Herbert Xu
    Signed-off-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Wu Fengguang
     
  • vlan devices are currently not multi-queue capable.

    We can do that with a new rtnl_link_ops method,
    get_tx_queues(), called from rtnl_create_link()

    This new method gets num_tx_queues/real_num_tx_queues
    from real device.

    register_vlan_device() is also handled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2009

6 commits

  • The function block inet_connect_sock_af_ops contains no data
    make it constant.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Conflicts:
    drivers/net/yellowfin.c

    David S. Miller
     
  • These are full of unresolved problems, mainly that conversions don't
    work 1-1 from hrtimers to tasklet_hrtimers because unlike hrtimers
    tasklets can't be killed from softirq context.

    And when a qdisc gets reset, that's exactly what we need to do here.

    We'll work this out in the net-next-2.6 tree and if warranted we'll
    backport that work to -stable.

    This reverts the following 3 changesets:

    a2cb6a4dd470d7a64255a10b843b0d188416b78f
    ("pkt_sched: Fix bogon in tasklet_hrtimer changes.")

    38acce2d7983632100a9ff3fd20295f6e34074a8
    ("pkt_sched: Convert CBQ to tasklet_hrtimer.")

    ee5f9757ea17759e1ce5503bdae2b07e48e32af9
    ("pkt_sched: Convert qdisc_watchdog to tasklet_hrtimer")

    Signed-off-by: David S. Miller

    David S. Miller
     
  • These tables are never modified at runtime. Move to read-only
    section.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • This patch affects the retransmits_timed_out() function.

    Changes:
    1) Variables have more meaningful names
    2) retransmits_timed_out() has an introductionary comment.
    3) Small coding style changes.

    Signed-off-by: Damian Lukowski
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • struct net::ipv6.ip6_dst_ops is separatedly dynamically allocated,
    but there is no fundamental reason for it. Embed it directly into
    struct netns_ipv6.

    For that:
    * move struct dst_ops into separate header to fix circular dependencies
    I honestly tried not to, it's pretty impossible to do other way
    * drop dynamical allocation, allocate together with netns

    For a change, remove struct dst_ops::dst_net, it's deducible
    by using container_of() given dst_ops pointer.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

01 Sep, 2009

3 commits

  • RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
    which may represent a number of allowed retransmissions or a timeout value.
    Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
    in number of allowed retransmissions.

    For any desired threshold R2 (by means of time) one can specify tcp_retries2
    (by means of number of retransmissions) such that TCP will not time out
    earlier than R2. This is the case, because the RTO schedule follows a fixed
    pattern, namely exponential backoff.

    However, the RTO behaviour is not predictable any more if RTO backoffs can be
    reverted, as it is the case in the draft
    "Make TCP more Robust to Long Connectivity Disruptions"
    (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).

    In the worst case TCP would time out a connection after 3.2 seconds, if the
    initial RTO equaled MIN_RTO and each backoff has been reverted.

    This patch introduces a function retransmits_timed_out(N),
    which calculates the timeout of a TCP connection, assuming an initial
    RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.

    Whenever timeout decisions are made by comparing the retransmission counter
    to some value N, this function can be used, instead.

    The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
    can occur than the value indicates. However, it yields a timeout which is
    similar to the one of an unpatched, exponentially backing off TCP in the same
    scenario. As no application could rely on an RTO greater than MIN_RTO, there
    should be no risk of a regression.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • Here, an ICMP host/network unreachable message, whose payload fits to
    TCP's SND.UNA, is taken as an indication that the RTO retransmission has
    not been lost due to congestion, but because of a route failure
    somewhere along the path.
    With true congestion, a router won't trigger such a message and the
    patched TCP will operate as standard TCP.

    This patch reverts one RTO backoff, if an ICMP host/network unreachable
    message, whose payload fits to TCP's SND.UNA, arrives.
    Based on the new RTO, the retransmission timer is reset to reflect the
    remaining time, or - if the revert clocked out the timer - a retransmission
    is sent out immediately.
    Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
    there have been retransmissions and reversible backoffs, already.

    Changes from v2:
    1) Renaming of skb in tcp_v4_err() moved to another patch.
    2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
    3) Fixed code comments.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • Adds support of dcbnl setapp/getapp to dcbnl_rtnl_ops in netdev to allow
    LLDs to implement their corresponding dcbnl setapp/getapp ops to support
    the IEEE 802.1Q DCBX setapp/getapp commands.

    Signed-off-by: Yi Zou
    Acked-by: Peter P Waskiewicz Jr
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Yi Zou
     

31 Aug, 2009

1 commit


29 Aug, 2009

3 commits


26 Aug, 2009

1 commit


25 Aug, 2009

2 commits


24 Aug, 2009

1 commit


23 Aug, 2009

8 commits

  • None of this stuff should execute in hw IRQ context, therefore
    use a tasklet_hrtimer so that it runs in softirq context.

    Signed-off-by: David S. Miller
    Acked-by: Thomas Gleixner

    David S. Miller
     
  • When using DEFER_SETUP on a RFCOMM socket, a SABM frame triggers
    authorization which when rejected send a DM response. This is fine
    according to the RFCOMM spec:

    the responding implementation may replace the "proper" response
    on the Multiplexer Control channel with a DM frame, sent on the
    referenced DLCI to indicate that the DLCI is not open, and that
    the responder would not grant a request to open it later either.

    But some stacks doesn't seems to cope with this leaving DLCI 0 open after
    receiving DM frame.

    To fix it properly a timer was introduced to rfcomm_session which is used
    to set a timeout when the last active DLC of a session is unlinked, this
    will give the remote stack some time to reply with a proper DISC frame on
    DLCI 0 avoiding both sides sending DISC to each other on stacks that
    follow the specification and taking care of those who don't by taking
    down DLCI 0.

    Signed-off-by: Luiz Augusto von Dentz
    Signed-off-by: Marcel Holtmann

    Luiz Augusto von Dentz
     
  • Support for receiving of SREJ frames as specified by the state table.

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan
     
  • When L2CAP loses an I-frame we send a SREJ frame to the transmitter side
    requesting the lost packet. This patch implement all Recv I-frame events
    on SREJ_SENT state table except the ones that deal with SendRej (the REJ
    exception at receiver side is yet not implemented).

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan
     
  • Implement CRC16 check for L2CAP packets. FCS is used by Streaming Mode and
    Enhanced Retransmission Mode and is a extra check for the packet content.

    Using CRC16 is the default, L2CAP won't use FCS only when both side send
    a "No FCS" request.

    Initially based on a patch from Nathan Holstein

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan
     
  • L2CAP uses retransmission and monitor timers to inquiry the other side
    about unacked I-frames. After sending each I-frame we (re)start the
    retransmission timer. If it expires, we start a monitor timer that send a
    S-frame with P bit set and wait for S-frame with F bit set. If monitor
    timer expires, try again, at a maximum of L2CAP_DEFAULT_MAX_TX.

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan
     
  • When receiving an I-frame with unexpected txSeq, receiver side start the
    recovery procedure by sending a REJ S-frame to the transmitter side. So
    the transmitter can re-send the lost I-frame.

    This patch just adds a basic support for retransmission, it doesn't
    mean that ERTM now has full support for packet retransmission.

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan
     
  • ERTM should use Segmentation and Reassembly to break down a SDU in many
    PDUs on sending data to the other side.

    On sending packets we queue all 'segments' until end of segmentation and
    just the add them to the queue for sending. On receiving we create a new
    SKB with the SDU reassembled.

    Initially based on a patch from Nathan Holstein

    Signed-off-by: Gustavo F. Padovan
    Signed-off-by: Marcel Holtmann

    Gustavo F. Padovan