05 Jul, 2011

1 commit

  • This patch rearranges the order of statements of the slow-path input processing
    (i.e. any other state than OPEN), to resolve the following issues.

    1. Dependencies: the order of statements now better matches RFC 4340, 8.5, i.e.
    step 7 is before step 9 (previously 9 was before 7), and parsing options in
    step 8 (which may consume resources) now comes after step 7.
    2. Sequence number checks are omitted if in state LISTEN/REQUEST, due to the
    note underneath the table in RFC 4340, 7.5.3.
    As a result, CCID processing is now indeed confined to OPEN/PARTOPEN states,
    i.e. congestion control is performed only on the flow of data packets. This
    avoids pathological cases of doing congestion control on those messages
    which set up and terminate the connection.
    3. Packets are now passed on to Ack Vector / CCID processing only after
    - step 7 (receive unexpected packets),
    - step 9 (receive Reset),
    - step 13 (receive CloseReq),
    - step 14 (receive Close)
    and only if the state is PARTOPEN. This simplifies CCID processing:
    - in LISTEN/CLOSED the CCIDs are non-existent;
    - in RESPOND/REQUEST the CCIDs have not yet been negotiated;
    - in CLOSEREQ and active-CLOSING the node has already closed this socket;
    - in passive-CLOSING the client is waiting for its Reset.
    In the last case, RFC 4340, 8.3 leaves it open to ignore further incoming
    data, which is the approach taken here.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     

02 Mar, 2011

1 commit

  • This fixes a bug in the order of dccp_rcv_state_process() that still permitted
    reception even after closing the socket. A Reset after close thus causes a NULL
    pointer dereference by not preventing operations on an already torn-down socket.

    dccp_v4_do_rcv()
    |
    | state other than OPEN
    v
    dccp_rcv_state_process()
    |
    | DCCP_PKT_RESET
    v
    dccp_rcv_reset()
    |
    v
    dccp_time_wait()

    WARNING: at net/ipv4/inet_timewait_sock.c:141 __inet_twsk_hashdance+0x48/0x128()
    Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
    [] (unwind_backtrace+0x0/0xec) from [] (warn_slowpath_common)
    [] (warn_slowpath_common+0x4c/0x64) from [] (warn_slowpath_n)
    [] (warn_slowpath_null+0x1c/0x24) from [] (__inet_twsk_hashd)
    [] (__inet_twsk_hashdance+0x48/0x128) from [] (dccp_time_wai)
    [] (dccp_time_wait+0x40/0xc8) from [] (dccp_rcv_state_proces)
    [] (dccp_rcv_state_process+0x120/0x538) from [] (dccp_v4_do_)
    [] (dccp_v4_do_rcv+0x11c/0x14c) from [] (release_sock+0xac/0)
    [] (release_sock+0xac/0x110) from [] (dccp_close+0x28c/0x380)
    [] (dccp_close+0x28c/0x380) from [] (inet_release+0x64/0x70)

    The fix is by testing the socket state first. Receiving a packet in Closed state
    now also produces the required "No connection" Reset reply of RFC 4340, 8.3.1.

    Reported-and-tested-by: Johan Hovold
    Cc: stable@kernel.org
    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

07 Jan, 2011

1 commit

  • Currently dccp_check_seqno returns 0 (indicating a valid packet) if the
    acknowledgment number is out of bounds and the sync that RFC 4340 mandates at
    this point is currently being rate-limited. This function should return -1,
    indicating an invalid packet.

    Signed-off-by: Samuel Jero
    Acked-by: Gerrit Renker

    Samuel Jero
     

09 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • This fixes a bug in updating the Greatest Acknowledgment number Received (GAR):
    the current implementation does not track the greatest received value -
    lower values in the range AWL..AWH (RFC 4340, 7.5.1) erase higher ones.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

15 Nov, 2010

2 commits

  • This aggregates Ack Vector processing (handling input and clearing old state)
    into one function, for the following reasons and benefits:
    * all Ack Vector-specific processing is now in one place;
    * duplicated code is removed;
    * ensuring sanity: from an Ack Vector point of view, it is better to clear the
    old state first before entering new state;
    * Ack Event handling happens mostly within the CCIDs, not the main DCCP module.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This provides a routine to consistently update the buffer state when the
    peer acknowledges receipt of Ack Vectors; updating state in the list of Ack
    Vectors as well as in the circular buffer.

    While based on RFC 4340, several additional (and necessary) precautions were
    added to protect the consistency of the buffer state. These additions are
    essential, since analysis and experience showed that the basic algorithm was
    insufficient for this task (which lead to problems that were hard to debug).

    The algorithm now
    * deals with HC-sender acknowledging to HC-receiver and vice versa,
    * keeps track of the last unacknowledged but received seqno in tail_ackno,
    * has special cases to reset the overflow condition when appropriate,
    * is protected against receiving older information (would mess up buffer state).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     

11 Nov, 2010

1 commit

  • This patch brings the Ack Vector interface up to date. Its main purpose is
    to lay the basis for the subsequent patches of this set, which will use the
    new data structure fields and routines.

    There are no real algorithmic changes, rather an adaptation:

    (1) Replaced the static Ack Vector size (2) with a #define so that it can
    be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
    to be sufficient for the moment) and added a solution so that computing
    the ECN nonce will continue to work - even with larger Ack Vectors.

    (2) Replaced the #defines for Ack Vector states with a complete enum.

    (3) Replaced #defines to compute Ack Vector length and state with general
    purpose routines (inlines), and updated code to use these.

    (4) Added a `tail' field (conversion to circular buffer in subsequent patch).

    (5) Updated the (outdated) documentation for Ack Vector struct.

    (6) All sequence number containers now trimmed to 48 bits.

    (7) Removal of unused bits:
    * removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
    redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
    * removed Elapsed Time for Ack Vectors (it was nowhere used);
    * replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
    the code needs to be able to remember the old run length;
    * reduced the de-/allocation routines (redundant / duplicate tests).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     

12 Oct, 2010

2 commits

  • This omits the redundant "DCCP:" in warning messages, since DCCP_WARN() already
    echoes the function name, avoiding messages like

    kernel: [10988.766503] dccp_close: DCCP: ABORT -- 209 bytes unread

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This fixes a problem and a potential loophole with regard to seqno/ackno
    validity: currently the initial adjustments to AWL/SWL are only performed
    once at the begin of the connection, during the handshake.

    Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
    it is however necessary to perform these adjustments at least for the first
    W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

    This requirement is complicated by the fact that W/W' can change at any time
    during the lifetime of a connection.

    Therefore it is better to perform that safety check each time SWL/AWL are
    updated, as implemented by the patch.

    A second problem solved by this patch is that the remote/local Sequence Window
    feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
    feature negotiation has completed.

    During the initial handshake we have more stringent sequence number protection;
    the changes added by this patch effect that {A,S}W{L,H} are within the correct
    bounds at the instant that feature negotiation completes (since the SeqWin
    feature activation handlers call dccp_update_gsr/gss()).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     

26 Jun, 2010

1 commit

  • This patch is thanks to Andre Noll who reported the issue and helped testing.

    The Syn-RTT sampled during the initial handshake currently only works for
    the client sending the DCCP-Request. TFRC penalizes the absence of an RTT
    sample with a very slow initial speed (1 packet per second), which delays
    slow-start significantly, resulting in sluggish performance.

    This patch mirrors the "Syn RTT" principle by adding a timestamp also onto
    the DCCP-Response, producing an RTT sample when the (Data)Ack completing
    the handshake arrives.

    Also changed the documentation to 'TFRC' since Syn RTTs are also used by CCID-4.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

25 May, 2010

1 commit


12 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit


05 Jan, 2009

1 commit


08 Dec, 2008

2 commits

  • This removes the use of the sysctl and the minisock variable for the Send Ack
    Vector feature, as it now is handled fully dynamically via feature negotiation
    (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
    RFC 4341, 4.).

    Using a sysctl in parallel to this implementation would open the door to
    crashes, since much of the code relies on tests of the boolean minisock /
    sysctl variable. Thus, this patch replaces all tests of type

    if (dccp_msk(sk)->dccpms_send_ack_vector)
    /* ... */
    with
    if (dp->dccps_hc_rx_ackvec != NULL)
    /* ... */

    The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
    negotiation concluded that Ack Vectors are to be used on the half-connection.
    Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
    so that the test is a valid one.

    The activation handler for Ack Vectors is called as soon as the feature
    negotiation has concluded at the
    * server when the Ack marking the transition RESPOND => OPEN arrives;
    * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

    Adding the sequence number of the Response packet to the Ack Vector has been
    removed, since
    (a) connection establishment implies that the Response has been received;
    (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
    this entry will always be ignored;
    (c) it can not be used for anything useful - to detect loss for instance, only
    packets received after the loss can serve as pseudo-dupacks.

    There was a FIXME to change the error code when dccp_ackvec_add() fails.
    I removed this after finding out that:
    * the check whether ackno < ISN is already made earlier,
    * this Response is likely the 1st packet with an Ackno that the client gets,
    * so when dccp_ackvec_add() fails, the reason is likely not a packet error.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This integrates feature-activation in the client:

    1. When dccp_parse_options() fails, the reset code is already set; request_sent\
    _state_process() currently overrides this with `Packet Error', which is not
    intended - changed to use the reset code supplied by dccp_parse_options().

    2. When feature negotiation fails, the socket should be marked as not usable,
    so that the application is notified that an error occurred. This is achieved
    by a new label 'unable_to_proceed': generating an error code of `Aborted',
    setting the socket state to CLOSED, returning with ECOMM in sk_err.

    3. Avoids parsing the Ack twice in Respond state by not doing option processing
    again in dccp_rcv_respond_partopen_state_process (as option processing has
    already been done on the request_sock in dccp_check_req).

    Since this addresses congestion-control initialisation, a corresponding
    FIXME has been removed.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

05 Nov, 2008

1 commit

  • This provides feature-negotiation initialisation for both DCCP sockets
    and DCCP request_sockets, to support feature negotiation during
    connection setup.

    It also resolves a FIXME regarding the congestion control
    initialisation.

    Thanks to Wei Yongjun for help with the IPv6 side of this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

09 Sep, 2008

1 commit


04 Sep, 2008

10 commits

  • This extracts the clamping part of dccp_sample_rtt() and makes it available
    to other parts of the code (as e.g. used in the next patch).

    Note: The function dccp_sample_rtt() now reduces to subtracting the elapsed
    time. This could be eliminated but would require shorter prefixes and thus
    is not done by this patch - maybe an idea for later.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This patch rearranges the order of statements of the slow-path input processing
    (i.e. any other state than OPEN), to resolve the following issues.

    1. Dependencies: the order of statements now better matches RFC 4340, 8.5, i.e.
    step 7 is before step 9 (previously 9 was before 7), and parsing options in
    step 8 (which can consume resources) now comes after step 7.
    2. Bug-fix: in state CLOSED, there should not be any sequence number checking
    or option processing. This is why the test for CLOSED has been moved after
    the test for LISTEN.
    3. As before sequence number checks are omitted if in state LISTEN/REQUEST, due
    to the note underneath the table in RFC 4340, 7.5.3.
    4. Packets are now passed on to Ack Vector / CCID processing only after
    - step 7 (receive unexpected packets),
    - step 9 (receive Reset),
    - step 13 (receive CloseReq),
    - step 14 (receive Close)
    and only if the state is PARTOPEN. This simplifies CCID processing:
    - in LISTEN/CLOSED the CCIDs are non-existent;
    - in RESPOND/REQUEST the CCIDs have not yet been negotiated;
    - in CLOSEREQ and active-CLOSING the node has already closed this socket;
    - in passive-CLOSING the client is waiting for its Reset.
    In the last case, RFC 4340, 8.3 leaves it open to ignore further incoming
    data, which is the approach taken here.

    As a result of (3), CCID processing is now indeed confined to OPEN/PARTOPEN
    states, i.e. congestion control is performed only on the flow of data packets.

    This avoids pathological cases of doing congestion control on those messages
    which set up and terminate the connection.

    I have done a few checks to see if this creates a problem in other parts of
    the code. This seems not to be the case; even if there were one, it would be
    better to fix it than to perform congestion control on Close/Request/Response
    messages. Similarly for Ack Vectors (as they depend on the negotiated CCID).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This aggregates Ack Vector processing (handling input and clearing old state)
    into one function, for the following reasons and benefits:
    * all Ack Vector-specific processing is now in one place;
    * duplicated code is removed;
    * ensuring sanity: from an Ack Vector point of view, it is better to clear the
    old state first before entering new state;
    * Ack Event handling happens mostly within the CCIDs, not the main DCCP module.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This provides a routine to consistently update the buffer state when the
    peer acknowledges receipt of Ack Vectors; updating state in the list of Ack
    Vectors as well as in the circular buffer.

    While based on RFC 4340, several additional (and necessary) precautions were
    added to protect the consistency of the buffer state. These additions are
    essential, since analysis and experience showed that the basic algorithm was
    insufficient for this task (which lead to problems that were hard to debug).

    The algorithm now
    * deals with HC-sender acknowledging to HC-receiver and vice versa,
    * keeps track of the last unacknowledged but received seqno in tail_ackno,
    * has special cases to reset the overflow condition when appropriate,
    * is protected against receiving older information (would mess up buffer state).

    Note: The older code performed an unnecessary step, where the sender cleared
    Ack Vector state by parsing the Ack Vector received by the HC-receiver. Doing
    this was entirely redundant, since
    * the receiver always puts the full acknowledgment window (groups 2,3 in 11.4.2)
    into the Ack Vectors it sends; hence the HC-receiver is only interested in the
    highest state that the HC-sender received;
    * this means that the acknowledgment number on the (Data)Ack from the HC-sender
    is sufficient; and work done in parsing earlier state is not necessary, since
    the later state subsumes the earlier one (see also RFC 4340, A.4).
    This older interface (dccp_ackvec_parse()) is therefore removed.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This patch brings the Ack Vector interface up to date. Its main purpose is
    to lay the basis for the subsequent patches of this set, which will use the
    new data structure fields and routines.

    There are no real algorithmic changes, rather an adaptation:

    (1) Replaced the static Ack Vector size (2) with a #define so that it can
    be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
    to be sufficient for the moment) and added a solution so that computing
    the ECN nonce will continue to work - even with larger Ack Vectors.

    (2) Replaced the #defines for Ack Vector states with a complete enum.

    (3) Replaced #defines to compute Ack Vector length and state with general
    purpose routines (inlines), and updated code to use these.

    (4) Added a `tail' field (conversion to circular buffer in subsequent patch).

    (5) Updated the (outdated) documentation for Ack Vector struct.

    (6) All sequence number containers now trimmed to 48 bits.

    (7) Removal of unused bits:
    * removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
    redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
    * removed Elapsed Time for Ack Vectors (it was nowhere used);
    * replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
    the code needs to be able to remember the old run length;
    * reduced the de-/allocation routines (redundant / duplicate tests).

    Justification for removing Elapsed Time information [can be removed]:
    ---------------------------------------------------------------------
    1. The Elapsed Time information for Ack Vectors was nowhere used in the code.
    2. DCCP does not implement rate-based pacing of acknowledgments. The only
    recommendation for always including Elapsed Time is in section 11.3 of
    RFC 4340: "Receivers that rate-pace acknowledgements SHOULD [...]
    include Elapsed Time options". But such is not the case here.
    3. It does not really improve estimation accuracy. The Elapsed Time field only
    records the time between the arrival of the last acknowledgeable packet and
    the time the Ack Vector is sent out. Since Linux does not (yet) implement
    delayed Acks, the time difference will typically be small, since often the
    arrival of a data packet triggers sending feedback at the HC-receiver.

    Justification for changes in de-/allocation routines [can be removed]:
    ----------------------------------------------------------------------
    * INIT_LIST_HEAD in dccp_ackvec_record_new was redundant, since the list
    pointers were later overwritten when the node was added via list_add();
    * dccp_ackvec_record_new() was called in a single place only;
    * calls to list_del_init() before calling dccp_ackvec_record_delete() were
    redundant, since subsequently the entire element was k-freed;
    * since all calls to dccp_ackvec_record_delete() were preceded to a call to
    list_del_init(), the WARN_ON test would never evaluate to true;
    * since all calls to dccp_ackvec_record_delete() were made from within
    list_for_each_entry_safe(), the test for avr == NULL was redundant;
    * list_empty() in ackvec_free was redundant, since the same condition is
    embedded in the loop condition of the subsequent list_for_each_entry_safe().

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • This fixes a problem and a potential loophole with regard to seqno/ackno
    validity: the problem is that the initial adjustments to AWL/SWL were
    only performed at the begin of the connection, during the handshake.

    Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
    it is however necessary to perform these adjustments at least for the first
    W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

    This requirement is complicated by the fact that W/W' can change at any time
    during the lifetime of a connection.

    Therefore the consequence is to perform this safety check each time SWL/AWL
    are updated.

    A second problem solved by this patch is that the remote/local Sequence Window
    feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
    feature negotiation has completed.

    During the initial handshake we have more stringent sequence number protection,
    the changes added by this patch effect that {A,S}W{L,H} are within the correct
    bounds at the instant that feature negotiation completes (since the SeqWin
    feature activation handlers call dccp_update_gsr/gss()).

    A detailed rationale is below -- can be removed from the commit message.

    1. Server sequence number checks during initial handshake
    ---------------------------------------------------------
    The server can not use the fields of the listening socket for seqno/ackno checks
    and thus needs to store all relevant information on a per-connection basis on
    the dccp_request socket. This is a size-constrained structure and has currently
    only ISS (dreq_iss) and ISR (dreq_isr) defined.
    Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct
    and it is questionable whether this will have any practical gain. The currently
    implemented solution is as follows.
    * receiving first Request: dccp_v{4,6}_conn_request sets
    ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence()

    * sending first Response: dccp_v{4,6}_send_response via dccp_make_response()
    sets P.seqno := ISS, sets P.ackno := ISR

    * receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno

    * answering retransmitted Request: dccp_make_response() sets ISS += 1,
    otherwise as per first Response

    * completing the handshake: succeeds in dccp_check_req() for the first Ack
    where P.ackno == ISS (P.seqno is not tested)

    * creating child socket: ISS, ISR are copied from the request_sock

    This solution will succeed whenever the server can receive the Request and the
    subsequent Ack in succession, without retransmissions. If there is packet loss,
    the client needs to retransmit until this condition succeeds; it will otherwise
    eventually give up. Adding further fields to the request_sock could increase
    the robustness a bit, in that it would make possible to let a reordered Ack
    (from a retransmitted Response) pass. The argument against such a solution is
    that if the packet loss is not persistent and an Ack gets through, why not
    wait for the one answering the original response: if the loss is persistent, it
    is probably better to not start the connection in the first place.

    Long story short: the present design (by Arnaldo) is simple and will likely work
    just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are
    not needed until the moment the request_sock is cloned into the accept queue.

    At that stage feature negotiation has completed, so that the values for the local
    and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better
    position to compute {A,S}W{L,H}.

    2. Client sequence number checks during initial handshake
    ---------------------------------------------------------
    Until entering PARTOPEN the client does not need the adjustments, since it
    constrains the Ack window to the packet it sent.

    * sending first Request: dccp_v{4,6}_connect() choose ISS,
    dccp_connect() then sets GAR := ISS (as per 8.5),
    dccp_transmit_skb() (with the previous bug fix) sets
    GSS := ISS, AWL := ISS, AWH := GSS
    * n-th retransmitted Request (with previous patch):
    dccp_retransmit_skb() via timer calls
    dccp_transmit_skb(), which sets GSS := ISS+n
    and then AWL := ISS, AWH := ISS+n

    * receiving any Response: dccp_rcv_request_sent_state_process()
    -- accepts packet if AWL

    Gerrit Renker
     
  • This removes the use of the sysctl and the minisock variable for the Send Ack
    Vector feature, which is now handled fully dynamically via feature negotiation;
    i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per
    RFC 4341, 4.).

    Using a sysctl in parallel to this implementation would open the door to
    crashes, since much of the code relies on tests of the boolean minisock /
    sysctl variable. Thus, this patch replaces all tests of type

    if (dccp_msk(sk)->dccpms_send_ack_vector)
    /* ... */
    with
    if (dp->dccps_hc_rx_ackvec != NULL)
    /* ... */

    The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
    negotiation concluded that Ack Vectors are to be used on the half-connection.
    Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
    so that the test is a valid one.

    The activation handler for Ack Vectors is called as soon as the feature
    negotiation has concluded at the
    * server when the Ack marking the transition RESPOND => OPEN arrives;
    * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

    Adding the sequence number of the Response packet to the Ack Vector has been
    removed, since
    (a) connection establishment implies that the Response has been received;
    (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
    this entry will always be ignored;
    (c) it can not be used for anything useful - to detect loss for instance, only
    packets received after the loss can serve as pseudo-dupacks.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald

    Gerrit Renker
     
  • This integrates feature-activation in the client, with these details:

    1. When dccp_parse_options() fails, the reset code is already set, request_sent
    _state_process() currently overrides this with `Packet Error', which is not
    intended - so changed to use the reset code set in dccp_parse_options();

    2. There was a FIXME to change the error code when dccp_ackvec_add() fails.
    I have looked this up and found that:
    * the check whether ackno < ISN is already made earlier,
    * this Response is likely the 1st packet with an Ackno that the client gets,
    * so when dccp_ackvec_add() fails, the reason is likely not a packet error.

    3. When feature negotiation fails, the socket should be marked as not usable,
    so that the application is notified that an error occurs. This is achieved
    by a new label, which uses an error code of `Aborted' and which sets the
    socket state to CLOSED, as well as sk_err.

    4. Avoids parsing the Ack twice in Respond state by not doing option processing
    again in dccp_rcv_respond_partopen_state_process (as option processing has
    already been done on the request_sock in dccp_check_req).

    Since this addresses congestion-control initialisation, a corresponding
    FIXME has been removed.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald

    Gerrit Renker
     
  • This provides feature-negotiation initialisation for both DCCP sockets and
    DCCP request_sockets, to support feature negotiation during connection setup.

    It also resolves a FIXME regarding the congestion control initialisation.

    Thanks to Wei Yongjun for help with the IPv6 side of this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald

    Gerrit Renker
     
  • RFC4340 states that if a packet is received with an option error (such as a
    Mandatory Option as the last byte of the option list), the endpoint should
    repond with a Reset.

    In the LISTEN and RESPOND states, the endpoint correctly reponds with Reset,
    while in the REQUEST/OPEN states, packets with option errors are just ignored.

    The packet sequence is as follows:

    Case 1:

    Endpoint A Endpoint B
    (CLOSED) (CLOSED)

    (*1)
    (with invalid option)
    (*2)
    (with invalid option)

    Acked-by: Arnaldo Carvalho de Melo
    Acked-by: Gerrit Renker

    Wei Yongjun
     

19 Aug, 2008

1 commit

  • Thanks is due to Wei Yongjun for the detailed analysis and description of this
    bug at http://marc.info/?l=dccp&m=121739364909199&w=2

    The problem is that invalid packets received by a client in state REQUEST cause
    the retransmission timer for the DCCP-Request to be reset. This includes freeing
    the Request-skb ( in dccp_rcv_request_sent_state_process() ). As a consequence,
    * the arrival of further packets cause a double-free, triggering a panic(),
    * the connection then may hang, since further retransmissions are blocked.

    This patch changes the order of statements so that the retransmission timer is
    reset, and the pending Request freed, only if a valid Response has arrived (or
    the number of sysctl-retries has been exhausted).

    Further changes:
    ----------------
    To be on the safe side, replaced __kfree_skb with kfree_skb so that if due to
    unexpected circumstances the sk_send_head is NULL the WARN_ON is used instead.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

29 Jan, 2008

6 commits

  • The option parsing code currently only parses on full sk's. This causes a problem for
    options sent during the initial handshake (in particular timestamps and feature-negotiation
    options). Therefore, this patch extends the option parsing code with an additional argument
    for request_socks: if it is non-NULL, options are parsed on the request socket, otherwise
    the normal path (parsing on the sk) is used.

    Subsequent patches, which implement feature negotiation during connection setup, make use
    of this facility.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch performs two changes:

    1) Close the write-end in addition to the read-end when a fin-like segment
    (Close or CloseReq) is received by DCCP. This accounts for the fact that DCCP,
    in contrast to TCP, does not have a half-close. RFC 4340 says in this respect
    that when a fin-like segment has been sent there is no guarantee at all that
    any further data will be processed.
    Thus this patch performs SHUT_WR in addition to the SHUT_RD when a fin-like
    segment is encountered.

    2) Minor change: I noted that code appears twice in different places and think it
    makes sense to put this into a self-contained function (dccp_enqueue()).

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This removes a redundant test for unexpected packet types. In dccp_rcv_state_process
    it is tested twice whether a DCCP-server has received a CloseReq (Step 7):

    * first in the combined if-statement,
    * then in the call to dccp_rcv_closereq().

    The latter is necesssary since dccp_rcv_closereq() is also called from
    __dccp_rcv_established().

    This patch removes the duplicate test.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This adds the necessary state transitions for the two forms of passive-close

    * PASSIVE_CLOSE - which is entered when a host receives a Close;
    * PASSIVE_CLOSEREQ - which is entered when a client receives a CloseReq.

    Here is a detailed account of what the patch does in each state.

    1) Receiving CloseReq

    The pseudo-code in 8.5 says:

    Step 13: Process CloseReq
    If P.type == CloseReq and S.state < CLOSEREQ,
    Generate Close
    S.state := CLOSING
    Set CLOSING timer.

    This means we need to address what to do in CLOSED, LISTEN, REQUEST, RESPOND, PARTOPEN, and OPEN.

    * CLOSED: silently ignore - it may be a late or duplicate CloseReq;
    * LISTEN/RESPOND: will not appear, since Step 7 is performed first (we know we are the client);
    * REQUEST: perform Step 13 directly (no need to enqueue packet);
    * OPEN/PARTOPEN: enter PASSIVE_CLOSEREQ so that the application has a chance to process unread data.

    When already in PASSIVE_CLOSEREQ, no second CloseReq is enqueued. In any other state, the CloseReq is ignored.
    I think that this offers some robustness against rare and pathological cases: e.g. a simultaneous close where
    the client sends a Close and the server a CloseReq. The client will then be retransmitting its Close until it
    gets the Reset, so ignoring the CloseReq while in state CLOSING is sane.

    2) Receiving Close

    The code below from 8.5 is unconditional.

    Step 14: Process Close
    If P.type == Close,
    Generate Reset(Closed)
    Tear down connection
    Drop packet and return

    Thus we need to consider all states:
    * CLOSED: silently ignore, since this can happen when a retransmitted or late Close arrives;
    * LISTEN: dccp_rcv_state_process() will generate a Reset ("No Connection");
    * REQUEST: perform Step 14 directly (no need to enqueue packet);
    * RESPOND: dccp_check_req() will generate a Reset ("Packet Error") -- left it at that;
    * OPEN/PARTOPEN: enter PASSIVE_CLOSE so that application has a chance to process unread data;
    * CLOSEREQ: server performed active-close -- perform Step 14;
    * CLOSING: simultaneous-close: use a tie-breaker to avoid message ping-pong (see comment);
    * PASSIVE_CLOSEREQ: ignore - the peer has a bug (sending first a CloseReq and now a Close);
    * TIMEWAIT: packet is ignored.

    Note that the condition of receiving a packet in state CLOSED here is different from the condition "there
    is no socket for such a connection": the socket still exists, but its state indicates it is unusable.

    Last, dccp_finish_passive_close sets either DCCP_CLOSED or DCCP_CLOSING = TCP_CLOSING, so that
    sk_stream_wait_close() will wait for the final Reset (which will trigger CLOSING => CLOSED).

    Signed-off-by: Gerrit Renker
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • The sock_wake_async() performs a bit different actions
    depending on "how" argument. Unfortunately this argument
    ony has numerical magic values.

    I propose to give names to their constants to help people
    reading this function callers understand what's going on
    without looking into this function all the time.

    I suppose this is 2.6.25 material, but if it's not (or the
    naming seems poor/bad/awful), I can rework it against the
    current net-2.6 tree.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This extends the DCCP socket API by honouring any shutdown(2) option set by the user.
    The behaviour is, as much as possible, made consistent with the API for TCP's shutdown.

    This patch exploits the information provided by the user via the socket API to reduce
    processing costs:
    * if the read end is closed (SHUT_RD), it is not necessary to deliver to input CCID;
    * if the write end is closed (SHUT_WR), the same idea applies, but with a difference -
    as long as the TX queue has not been drained, we need to receive feedback to keep
    congestion-control rates up to date. Hence SHUT_WR is honoured only after the last
    packet (under congestion control) has been sent;
    * although SHUT_RDWR seems nonsensical, it is nevertheless supported in the same manner
    as for TCP (and agrees with test for SHUTDOWN_MASK in dccp_poll() in net/dccp/proto.c).

    Furthermore, most of the code already honours the sk_shutdown flags (dccp_recvmsg() for
    instance sets the read length to 0 if SHUT_RD had been called); CCID handling is now added
    to this by the present patch.

    There will also no longer be any delivery when the socket is in the final stages, i.e. when
    one of dccp_close(), dccp_fin(), or dccp_done() has been called - which is fine since at
    that stage the connection is its final stages.

    Motivation and background are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/shutdown

    A FIXME has been added to notify the other end if SHUT_RD has been set (RFC 4340, 11.7).

    Note: There is a comment in inet_shutdown() in net/ipv4/af_inet.c which asks to "make
    sure the socket is a TCP socket". This should probably be extended to mean
    `TCP or DCCP socket' (the code is also used by UDP and raw sockets).

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     

24 Oct, 2007

1 commit

  • This adds support for converting the 11 currently defined Reset codes into system
    error numbers, which are stored in sk_err for further interpretation.

    This makes the externally visible API behaviour similar to TCP, since a client
    connecting to a non-existing port will experience ECONNREFUSED.

    * Code 0, Unspecified, is interpreted as non-error (0);
    * Code 1, Closed (normal termination), also maps into 0;
    * Code 2, Aborted, maps into "Connection reset by peer" (ECONNRESET);
    * Code 3, No Connection and
    Code 7, Connection Refused, map into "Connection refused" (ECONNREFUSED);
    * Code 4, Packet Error, maps into "No message of desired type" (ENOMSG);
    * Code 5, Option Error, maps into "Illegal byte sequence" (EILSEQ);
    * Code 6, Mandatory Error, maps into "Operation not supported on transport endpoint" (EOPNOTSUPP);
    * Code 8, Bad Service Code, maps into "Invalid request code" (EBADRQC);
    * Code 9, Too Busy, maps into "Too many users" (EUSERS);
    * Code 10, Bad Init Cookie, maps into "Invalid request descriptor" (EBADR);
    * Code 11, Aggression Penalty, maps into "Quota exceeded" (EDQUOT)
    which makes sense in terms of using more than the `fair share' of bandwidth.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo

    Gerrit Renker
     

18 Oct, 2007

1 commit

  • Do not define the sysctl_dccp_sync_ratelimit sysctl variable in the
    CONFIG_SYSCTL dependent sysctl.c module - move it to input.c instead.

    This fixes the following build bug:

    net/built-in.o: In function `dccp_check_seqno':
    input.c:(.text+0xbd859): undefined reference to `sysctl_dccp_sync_ratelimit'
    distcc[29953] ERROR: compile (null) on localhost failed
    make: *** [vmlinux] Error 1

    Found via 'make randconfig' build testing.

    Signed-off-by: Ingo Molnar
    Acked-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ingo Molnar