Eric Lee / smarc-fsl-linux-kernel

04 Sep, 2008

40 commits

c8bf462bc dccp ccid-2: Separate option parsing from CCID processing ... Browse Code »

This patch replaces an almost identical replication of code: large parts
of dccp_parse_options() re-appeared as ccid2_ackvector() in ccid2.c.

Apart from the duplication, this caused two more problems:
1. CCIDs should not need to be concerned with parsing header options;
2. one can not assume that Ack Vectors appear as a contiguous area within an
skb, it is legal to insert other options and/or padding in between. The
current code would throw an error and stop reading in such a case.

The patch provides a new data structure and associated list housekeeping.

Only small changes were necessary to integrate with CCID-2: data structure
initialisation, adapt list traversal routine, and add call to the provided
cleanup routine.

The latter also lead to fixing the following BUG: CCID-2 so far ignored
Ack Vectors on all packets other than Ack/DataAck, which is incorrect,
since Ack Vectors can be present on any packet that has an Ack field.

Details:
--------
* received Ack Vectors are parsed by dccp_parse_options() alone, which passes
the result on to the CCID-specific routine ccid_hc_tx_parse_options();
* CCIDs interested in using/decoding Ack Vector information will add code
to fetch parsed Ack Vectors via this interface;
* a data structure, `struct dccp_ackvec_parsed' is provided as interface;
* this structure arranges Ack Vectors of the same skb into a FIFO order;
* a doubly-linked list is used to keep the required FIFO code small.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
5a577b488 dccp ccid-2: Remove old infrastructure ... Browse Code »

This removes
* functions for which updates have been provided in the preceding patches and
* the @av_vec_len field - it is no longer necessary since the buffer length is
now always computed dynamically;
* conditional debugging code (CONFIG_IP_DCCP_ACKVEC).

The reason for removing the conditional debugging code is that Ack Vectors are
an almost inevitable necessity - RFC 4341 says that for CCID-2, Ack Vectors must
be used. Furthermore, the code would be only interesting for coding - after some
extensive testing with this patch set, having the debug code around is no longer
of real help.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
c2f42077b dccp ccid-2: Schedule Sync as out-of-band mechanism ... Browse Code »

The problem with Ack Vectors is that

i) their length is variable and can in principle grow quite large,
ii) it is hard to predict exactly how large they will be.

Due to the second point it seems not a good idea to reduce the MPS; in
particular when on average there is enough room for the Ack Vector and an
increase in length is momentarily due to some burst loss, after which the
Ack Vector returns to its normal/average length.

The solution taken by this patch is to subtract a minimum-expected Ack Vector
length from the MPS (previous patch), and to defer any larger Ack Vectors onto
a separate Sync - but only if indeed there is no space left on the skb.

This patch provides the infrastructure to schedule Sync-packets for transporting
(urgent) out-of-band data. Its signalling is quicker than scheduling an Ack, since
it does not need to wait for new application data.

It can thus serve other parts of the DCCP code as well.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
283fb4a5f dccp ccid-2: Consolidate Ack-Vector processing within main DCCP module ... Browse Code »

This aggregates Ack Vector processing (handling input and clearing old state)
into one function, for the following reasons and benefits:
* all Ack Vector-specific processing is now in one place;
* duplicated code is removed;
* ensuring sanity: from an Ack Vector point of view, it is better to clear the
old state first before entering new state;
* Ack Event handling happens mostly within the CCIDs, not the main DCCP module.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
e28fe59f9 dccp ccid-2: Update code for the Ack Vector input/registration routine ... Browse Code »

This patch uupdates the code which registers new packets as received, using the
new circular buffer interface. It contributes a new algorithm which
* supports both tail/head pointers and buffer wrap-around and
* deals with overflow (head/tail move in lock-step).

The updated code is also partioned differently, into
1. dealing with the empty buffer,
2. adding new packets into non-empty buffer,
3. reserving space when encountering a `hole' in the sequence space,
4. updating old state and deciding when old state is irrelevant.

Protection against large burst losses: With regard to (3), it is too costly to
reserve space when there are large bursts of losses. When bursts get too large,
the code does no longer reserve space and just fills in cells normally. This
measure reduces space consumption by a factor of 63.

The code reuses in part the previous implementation by Arnaldo de Melo.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
68b1de157 dccp ccid-2: Algorithm to update buffer state ... Browse Code »

This provides a routine to consistently update the buffer state when the
peer acknowledges receipt of Ack Vectors; updating state in the list of Ack
Vectors as well as in the circular buffer.

While based on RFC 4340, several additional (and necessary) precautions were
added to protect the consistency of the buffer state. These additions are
essential, since analysis and experience showed that the basic algorithm was
insufficient for this task (which lead to problems that were hard to debug).

The algorithm now
* deals with HC-sender acknowledging to HC-receiver and vice versa,
* keeps track of the last unacknowledged but received seqno in tail_ackno,
* has special cases to reset the overflow condition when appropriate,
* is protected against receiving older information (would mess up buffer state).

Note: The older code performed an unnecessary step, where the sender cleared
Ack Vector state by parsing the Ack Vector received by the HC-receiver. Doing
this was entirely redundant, since
* the receiver always puts the full acknowledgment window (groups 2,3 in 11.4.2)
into the Ack Vectors it sends; hence the HC-receiver is only interested in the
highest state that the HC-sender received;
* this means that the acknowledgment number on the (Data)Ack from the HC-sender
is sufficient; and work done in parsing earlier state is not necessary, since
the later state subsumes the earlier one (see also RFC 4340, A.4).
This older interface (dccp_ackvec_parse()) is therefore removed.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:37 +0800
d7dc7e5f4 dccp ccid-2: Implementation of circular Ack Vector buffer with overflow handling ... Browse Code »

This completes the implementation of a circular buffer for Ack Vectors, by
extending the current (linear array-based) implementation. The changes are:

(a) An `overflow' flag to deal with the case of overflow. As before, dynamic
growth of the buffer will not be supported; but code will be added to deal
robustly with overflowing Ack Vector buffers.

(b) A `tail_seqno' field. When naively implementing the algorithm of Appendix A
in RFC 4340, problems arise whenever subsequent Ack Vector records overlap,
which can bring the entire run length calculation completely out of synch.
(This is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\
ack_vectors/tracking_tail_ackno/ .)
(c) The buffer lengthi is now computed dynamically (i.e. current fill level),
as the span between head to tail.

As a result, dccp_ackvec_pending() is now simpler - the #ifdef is no longer
necessary since buf_empty is always true when IP_DCCP_ACKVEC is not configured.

Note on overflow handling:
-------------------------
The Ack Vector code previously simply started to drop packets when the
Ack Vector buffer overflowed. This means that the userspace application
will not be able to receive, only because of an Ack Vector storage problem.

Furthermore, overflow may be transient, so that applications may later
recover from the overflow. Recovering from dropped packets is more difficult
(e.g. video key frames).

Hence the patch uses a different policy: when the buffer overflows, the oldest
entries are subsequently overwritten. This has a higher chance of recovery.
Details are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ack_vectors/

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:36 +0800
4829007c7 dccp ccid-2: Separate internals of Ack Vectors from option-parsing code ... Browse Code »

This patch
* separates Ack Vector housekeeping code from option-insertion code;
* shifts option-specific code from ackvec.c into options.c;
* introduces a dedicated routine to take care of the Ack Vector records;
* simplifies the dccp_ackvec_insert_avr() routine: the BUG_ON was redundant,
since the list is automatically arranged in descending order of ack_seqno.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:36 +0800
ff49e2708 dccp ccid-2: Ack Vector interface clean-up ... Browse Code »

This patch brings the Ack Vector interface up to date. Its main purpose is
to lay the basis for the subsequent patches of this set, which will use the
new data structure fields and routines.

There are no real algorithmic changes, rather an adaptation:

(1) Replaced the static Ack Vector size (2) with a #define so that it can
be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
to be sufficient for the moment) and added a solution so that computing
the ECN nonce will continue to work - even with larger Ack Vectors.

(2) Replaced the #defines for Ack Vector states with a complete enum.

(3) Replaced #defines to compute Ack Vector length and state with general
purpose routines (inlines), and updated code to use these.

(4) Added a `tail' field (conversion to circular buffer in subsequent patch).

(5) Updated the (outdated) documentation for Ack Vector struct.

(6) All sequence number containers now trimmed to 48 bits.

(7) Removal of unused bits:
* removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
* removed Elapsed Time for Ack Vectors (it was nowhere used);
* replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
the code needs to be able to remember the old run length;
* reduced the de-/allocation routines (redundant / duplicate tests).

Justification for removing Elapsed Time information [can be removed]:
---------------------------------------------------------------------
1. The Elapsed Time information for Ack Vectors was nowhere used in the code.
2. DCCP does not implement rate-based pacing of acknowledgments. The only
recommendation for always including Elapsed Time is in section 11.3 of
RFC 4340: "Receivers that rate-pace acknowledgements SHOULD [...]
include Elapsed Time options". But such is not the case here.
3. It does not really improve estimation accuracy. The Elapsed Time field only
records the time between the arrival of the last acknowledgeable packet and
the time the Ack Vector is sent out. Since Linux does not (yet) implement
delayed Acks, the time difference will typically be small, since often the
arrival of a data packet triggers sending feedback at the HC-receiver.

Justification for changes in de-/allocation routines [can be removed]:
----------------------------------------------------------------------
* INIT_LIST_HEAD in dccp_ackvec_record_new was redundant, since the list
pointers were later overwritten when the node was added via list_add();
* dccp_ackvec_record_new() was called in a single place only;
* calls to list_del_init() before calling dccp_ackvec_record_delete() were
redundant, since subsequently the entire element was k-freed;
* since all calls to dccp_ackvec_record_delete() were preceded to a call to
list_del_init(), the WARN_ON test would never evaluate to true;
* since all calls to dccp_ackvec_record_delete() were made from within
list_for_each_entry_safe(), the test for avr == NULL was redundant;
* list_empty() in ackvec_free was redundant, since the same condition is
embedded in the loop condition of the subsequent list_for_each_entry_safe().

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:36 +0800
b8c6bcee1 dccp: Reduce noise in output and convert to ktime_t ... Browse Code »

This fixes the problem that dccp_probe output can grow quite large without
apparent benefit (many identical data points), creating huge files (up to
over one Gigabyte for a few minutes' test run) which are very hard to
post-process (in one instance it got so bad that gnuplot ate up all memory
plus swap).

The cause for the problem is that the kprobe is inserted into dccp_sendmsg(),
which can be called in a polling-mode (whenever the TX queue is full due to
congestion-control issues, EAGAIN is returned). This creates many very
similar data points, i.e. the increase of processing time does not increase
the quality/information of the probe output.

The fix is to attach the probe to a different function -- write_xmit was
chosen since it gets called continually (both via userspace and timer);
an input-path function would stop sampling as soon as the other end stops
sending feedback.

For comparison the output file sizes for the same 20 second test
run over a lossy link:
* before / without patch: 118 Megabytes
* after / with patch: 1.2 Megabytes
and there was much less noise in the output.

To allow backward compatibility with scripts that people use, the now-unused
`size' field in the output has been replaced with the CCID identifier. This
also serves for future compatibility - support for CCID2 is work in progress
(depends on the still unfinished SRTT/RTTVAR updates).

While at it, the update to ktime_t was also performed.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:36 +0800
a9c1656ab dccp: Merge now-reduced connect_init() function ... Browse Code »

After moving the assignment of GAR/ISS from dccp_connect_init() to
dccp_transmit_skb(), the former function becomes very small, so that
a merger with dccp_connect() suggests itself.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:35 +0800
bfbddd085 dccp: Fix the adjustments to AWL and SWL ... Browse Code »

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: the problem is that the initial adjustments to AWL/SWL were
only performed at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore the consequence is to perform this safety check each time SWL/AWL
are updated.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection,
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

A detailed rationale is below -- can be removed from the commit message.

1. Server sequence number checks during initial handshake
---------------------------------------------------------
The server can not use the fields of the listening socket for seqno/ackno checks
and thus needs to store all relevant information on a per-connection basis on
the dccp_request socket. This is a size-constrained structure and has currently
only ISS (dreq_iss) and ISR (dreq_isr) defined.
Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct
and it is questionable whether this will have any practical gain. The currently
implemented solution is as follows.
* receiving first Request: dccp_v{4,6}_conn_request sets
ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence()

* sending first Response: dccp_v{4,6}_send_response via dccp_make_response()
sets P.seqno := ISS, sets P.ackno := ISR

* receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno

* answering retransmitted Request: dccp_make_response() sets ISS += 1,
otherwise as per first Response

* completing the handshake: succeeds in dccp_check_req() for the first Ack
where P.ackno == ISS (P.seqno is not tested)

* creating child socket: ISS, ISR are copied from the request_sock

This solution will succeed whenever the server can receive the Request and the
subsequent Ack in succession, without retransmissions. If there is packet loss,
the client needs to retransmit until this condition succeeds; it will otherwise
eventually give up. Adding further fields to the request_sock could increase
the robustness a bit, in that it would make possible to let a reordered Ack
(from a retransmitted Response) pass. The argument against such a solution is
that if the packet loss is not persistent and an Ack gets through, why not
wait for the one answering the original response: if the loss is persistent, it
is probably better to not start the connection in the first place.

Long story short: the present design (by Arnaldo) is simple and will likely work
just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are
not needed until the moment the request_sock is cloned into the accept queue.

At that stage feature negotiation has completed, so that the values for the local
and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better
position to compute {A,S}W{L,H}.

2. Client sequence number checks during initial handshake
---------------------------------------------------------
Until entering PARTOPEN the client does not need the adjustments, since it
constrains the Ack window to the packet it sent.

* sending first Request: dccp_v{4,6}_connect() choose ISS,
dccp_connect() then sets GAR := ISS (as per 8.5),
dccp_transmit_skb() (with the previous bug fix) sets
GSS := ISS, AWL := ISS, AWH := GSS
* n-th retransmitted Request (with previous patch):
dccp_retransmit_skb() via timer calls
dccp_transmit_skb(), which sets GSS := ISS+n
and then AWL := ISS, AWH := ISS+n

* receiving any Response: dccp_rcv_request_sent_state_process()
-- accepts packet if AWL

Gerrit Renker
2008-09-04 13:45:35 +0800
2975abd25 dccp: Schedule an Ack when receiving timestamps ... Browse Code »

This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:35 +0800
d0995e6a9 dccp ccid-3: Remove dead states ... Browse Code »

This patch is thanks to an investigation by Leandro Sales de Melo and his
colleagues. They worked out two state diagrams which highlight the fact that
the xxx_TERM states in CCID-3/4 are in fact not necessary.

And this can be confirmed by in turn looking at the code: the xxx_TERM states
are only ever set in ccid3_hc_{rx,tx}_exit(). These two functions are part
of the following call chain:

* ccid_hc_{tx,rx}_exit() are called from ccid_delete() only;
* ccid_delete() invokes ccid_hc_{tx,rx}_exit() in the way of a destructor:
after calling ccid_hc_{tx,rx}_exit(), the CCID is released from memory;
* ccid_delete() is in turn called only by ccid_hc_{tx,rx}_delete();
* ccid_hc_{tx,rx}_delete() is called only if
- feature negotiation failed (dccp_feat_activate_values()),
- when changing the RX/TX CCID (to eject the current CCID),
- when destroying the socket (in dccp_destroy_sock()).

In other words, when CCID-3 sets the state to xxx_TERM, it is at a time where
no more processing should be going on, hence it is not necessary to introduce
a dedicated exit state - this is implicit when unloading the CCID.

The patch removes this state, one switch-statement collapses as a result.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:35 +0800
5fe94963a dccp ccid-3: Remove duplicate documentation ... Browse Code »

This removes RX-socket documentation which is either duplicate or non-existent.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:35 +0800
c506d91d9 dccp: Unused argument in CCID tx function ... Browse Code »

This removes the argument `more' from ccid_hc_tx_packet_sent, since it was
nowhere used in the entire code.

(Anecdotally, this argument was not even used in the original KAME code where
the function originally came from; compare the variable moreToSend in the
freebsd61-dccp-kame-28.08.2006.patch now maintained by Emmanuel Lochin.)

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:35 +0800
f10ecaee6 dccp: Replace magic CCID-specific numbers by symbolic constants ... Browse Code »

The constants DCCPO_{MIN,MAX}_CCID_SPECIFIC are nowhere used in the code, but
instead for the CCID-specific options numbers are used.

This patch unifies the use of CCID-specific option numbers, by adding symbolic
names reflecting the definitions in RFC 4340, 10.3.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
ce177ae2e dccp ccid-3: Remove redundant 'options_received' struct ... Browse Code »

The `options_received' struct is redundant, since it re-duplicates the existing
`p' and `x_recv' fields. This patch removes the sub-struct and migrates the
format conversion operations (cf. below) to ccid3_hc_tx_parse_options().

Why the fields are redundant
----------------------------
The Loss Event Rate p and the Receive Rate x_recv are initially 0 when first
loading CCID-3, as ccid_new() zeroes out the entire ccid3_hc_tx_sock.

When Loss Event Rate or Receive Rate options are received, they are stored by
ccid3_hc_tx_parse_options() into the fields `ccid3or_loss_event_rate' and
`ccid3or_receive_rate' of the sub-struct `options_received' in ccid3_hc_tx_sock.

After parsing (considering only the established state - dccp_rcv_established()),
the packet is passed on to ccid_hc_tx_packet_recv(). This calls the CCID-3
specific routine ccid3_hc_tx_packet_recv(), which performs the following copy
operations between fields of ccid3_hc_tx_sock:

* hctx->options_received.ccid3or_receive_rate is copied into hctx->x_recv,
after scaling it for fixpoint arithmetic, by 2^64;
* hctx->options_received.ccid3or_loss_event_rate is copied into hctx->p,
considering the above special cases; in addition, a value of 0 here needs to
be mapped into p=0 (when no Loss Event Rate option has been received yet).

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
535c55df1 dccp tfrc/ccid-3: Computing Loss Rate from Loss Event Rate ... Browse Code »

This adds a function to take care of the following cases occurring in the
computation of the Loss Rate p:

* 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5;
* 1/0 is mapped into the maximum of 100%;
* we want to avoid that p = 1/x is rounded down to 0 when x is very large,
since this means accidentally re-entering slow-start (indicated by p==0).

In the last case, the minimum-resolution value of p is returned.

Furthermore, a bug in ccid3_hc_rx_getsockopt is fixed (1/0 was mapped into ~0U),
which now allows to consistently print the scaled p-values as

printf("Loss Event Rate = %u.%04u %%\n", rx_info.tfrcrx_p / 10000,
rx_info.tfrcrx_p % 10000);

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
3306c781f dccp: Add packet type information to CCID-specific option parsing ... Browse Code »

This patch ...
1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is
necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state
which options may (not) appear on what packet type.

2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in
RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets")
and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets").

3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This
is also no longer necessary, since the CCID-specific option-parsing routines
are passed every single parameter of the type-length-value option encoding.

Also added documentation and made argument naming scheme consistent.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
47a61e7b4 dccp ccid-3: Simplify and consolidate tx_parse_options ... Browse Code »

This simplifies and consolidates the TX option-parsing code:

1. The Loss Intervals option is not currently used, so dead code related to
this option is removed. I am aware of no plans to support the option, but
if someone wants to implement it (e.g. for inter-op tests), it is better
to start afresh than having to also update currently unused code.

2. The Loss Event and Receive Rate options have a lot of code in common (both
are 32 bit, both have same length etc.), so this is consolidated.

3. The test against GSR is not necessary, because
- on first loading CCID3, ccid_new() zeroes out all fields in the socket;
- ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to

pinv = opt_recv->ccid3or_loss_event_rate;
if (pinv == ~0U || pinv == 0)
hctx->p = 0;

- as a result, the sequence number field is removed from opt_recv.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
63b3a73bb dccp ccid-3: Remove ugly RTT-sampling history lookup ... Browse Code »

This removes the RTT-sampling function tfrc_tx_hist_rtt(), since

1. it suffered from complex passing of return values (the return value both
indicated successful lookup while the value doubled as RTT sample);

2. when for some odd reason the sample value equalled 0, this triggered a bug
warning about "bogus Ack", due to the ambiguity of the return value;

3. on a passive host which has not sent anything the TX history is empty and
thus will lead to unwanted "bogus Ack" warnings such as
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606.

The fix is to replace the implicit encoding by performing the steps manually.

Furthermore, the "bogus Ack" warning has been removed, since it can actually be
triggered due to several reasons (network reordering, old packet, (3) above),
hence it is not very useful.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:34 +0800
de6f2b59e dccp ccid-3: Bug fix for the inter-packet scheduling algorithm ... Browse Code »

This fixes a subtle bug in the calculation of the inter-packet gap and shows
that t_delta, as it is currently used, is not needed. And hence replaced.

The algorithm from RFC 3448, 4.6 below continually computes a send time t_nom,
which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies
the scheduling granularity, s the packet size, and X the sending rate:

t_distance = t_nom - t_now; // in microseconds
t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds

if (t_distance >= t_delta) {
reschedule after (t_distance / 1000) milliseconds;
} else {
t_ipi = s / X; // inter-packet interval in usec
t_nom += t_ipi; // compute the next send time
send packet now;
}

1) Description of the bug
-------------------------
Rescheduling requires a conversion into milliseconds, due to this call chain:

* ccid3_hc_tx_send_packet() returns a timeout in milliseconds,
* this value is converted by msecs_to_jiffies() in dccp_write_xmit(),
* and finally used as jiffy-expires-value for sk_reset_timer().

The highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher
granularity does not make much sense here.

As a consequence, values of t_distance < 1000 are truncated to 0. This issue
has so far been resolved by using instead

if (t_distance >= t_delta + 1000)
reschedule after (t_distance / 1000) milliseconds;

The bug is in artificially inflating t_delta to t_delta' = t_delta + 1000. This
is unnecessarily large, a more adequate value is t_delta' = max(t_delta, 1000).

2) Consequences of using the corrected t_delta'
-----------------------------------------------
Since t_delta = 500. This means that t_delta' = max(1000, t_delta) is constant at 1000.

On the other hand, when using a coarse HZ value of HZ < 500, we have three
sub-cases that can all be reduced to using another constant of t_gran/2.

(a) The first case arises when t_ipi > t_gran. Here t_delta' is the constant
t_delta' = max(1000, t_gran/2) = t_gran/2.

(b) If t_ipi < t_gran = 10^6/HZ usec, then t_delta = t_ipi/2

Gerrit Renker
2008-09-04 13:45:33 +0800
b2e317f4b dccp ccid-3: No more CCID control blocks in LISTEN state ... Browse Code »

The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.

This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID3), but which
now have become redundant.

As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
* the CCID is loaded, so it is not necessary to test if it is NULL,
* if it is possible to load a CCID and leave the private area NULL, then this
is a bug, which should crash loudly - and earlier,
* the test for state==OPEN || state==PARTOPEN now reduces only to the closing
phase (e.g. when the node has received an unexpected Reset).

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:33 +0800
842d1ef14 dccp ccid-3: Remove ccid3hc{tx,rx}_ prefixes ... Browse Code »

This patch does the same for CCID-3 as the previous patch for CCID-2:

s#ccid3hctx_##g;
s#ccid3hcrx_##g;

plus manual editing to retain consistency.

Please note: expanded the fields of the `struct tfrc_tx_info' in the hc_tx_sock,
since using short #define identifiers is not a good idea. The only place where
this embedded struct was used is ccid3_hc_tx_getsockopt().

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:33 +0800
1fb875096 dccp ccid-2: Remove ccid2hc{tx,rx}_ prefixes ... Browse Code »

This patch fixes two problems caused by the ubiquitous long "hctx->ccid2htx_"
and "hcrx->ccid2hcrx_" prefixes:
* code becomes hard to read;
* multiple-line statements are almost inevitable even for simple expressions;
The prefixes are not really necessary (compare with "struct tcp_sock").

There had been previous discussion of this on dccp@vger, but so far this was
not followed up (most people agreed that the prefixes are too long).

Signed-off-by: Gerrit Renker
Signed-off-by: Leandro Melo de Sales

Gerrit Renker
2008-09-04 13:45:33 +0800
88ddac513 dccp: Special case of the MPS for client-PARTOPEN with DataAcks ... Browse Code »

To increase robustness, it is necessary to resend Confirm feature-negotiation
options, even though the RFC does not mandate it. But feature negotiation
options can take (much) more room than the options on common DataAck packets.

Instead of reducing the MPS always for a case which only applies to the three
messages send during initial handshake, this patch devises a special case:

if the payload length of the DataAck in PARTOPEN is too large, an Ack is sent
to carry the options, and the feature-negotiation list is then flushed.

This means that the server gets two Acks for one Response. If both Acks get
lost, it is probably better to restart the connection anyway and devising yet
another special-case does not seem worth the extra complexity.

The patch (over-)estimates the expected overhead to be 32*4 bytes -- commonly
seen values were 20-90 bytes for initial feature-negotiation options.

It uses sizeof(u32) to mean "aligned units of 4 bytes". For consistency,
another use of sizeof is modified.

Signed-off-by: Gerrit Renker

Gerrit Renker
2008-09-04 13:45:33 +0800
55ebe3ab2 dccp: Leave headroom for options when calculating the MPS ... Browse Code »

The Maximum Packet Size (MPS) is of interest for applications which want
to transfer data, so it is only relevant to the data transfer phase of a
connection (unless one wants to send data on the DCCP-Request, but that is
not considered here).

The strategy chosen to deal with this requirement is to leave room for only
such options that may appear on data packets.

A special consideration applies to Ack Vectors: this is purely guesswork,
since these can have any length between 3 and 1020 bytes. The strategy
chosen here is to subtract a configurable minimum, the value of 16 bytes
(2 bytes for type/length plus 14 Ack Vector cells) has been found by
experimentatation. If people experience this as too much or too little,
this could later be turned into a Kconfig option.

There are currently no CCID-specific header options which may appear on data
packets, hence it is not necessary to define a corresponding CCID field.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:33 +0800
2faae5587 dccp ccid-2: Use feature-negotiation to report Ack Ratio changes ... Browse Code »

This uses the new feature-negotiation framework to signal Ack Ratio changes,
as required by RFC 4341, sec. 6.1.2.

This raises some problems for CCID-2 since it can at the moment not cope
gracefully with Ack Ratio of e.g. 2. A FIXME has thus been added which
reverts to the existing policy of bypassing the Ack Ratio sysctl.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
4861a3544 dccp: Support for exchanging of NN options in established state ... Browse Code »

This patch provides support for the reception of NN options in (PART)OPEN state.

It is a combination of change_recv() and confirm_recv(), specifically geared
towards receiving the `fast-path' NN options.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
624a965a9 dccp: Support for the exchange of NN options in established state ... Browse Code »

In contrast to static feature negotiation at the begin of a connection, which
establishes the capabilities of both endpoints, this patch introduces support
for dynamic exchange of feature negotiation options.

Such a dynamic exchange is necessary in at least two cases:
* CCID-2's Ack Ratio (RFC 4341, 6.1.2) which changes during the connection;
* Sequence Window values that, as per RFC 4340, 7.5.2, should be sent "as
as the connection progresses".

Both are NN (non-negotiable) features. Hence dynamic feature "negotiation" is
distinguished from static/pre-connection negotiation by the following:
* no new capabilities are negotiated (those that matter for the connection
are negotiated prior to setting up the connection, comparable to SIP);
* features must be understood by each endpoint: as per RFC 4340, 6.4,
Sequence Window is "Req'd" and Ack Ratio must be understood when CCID-2
is used as per the note underneath Table 4.

These characteristics are reflected in the implementation:
* only NN options can be exchanged after connection setup;
* NN options are activated directly after validating them. The rationale is
that a peer must accept every valid NN value (RFC 4340, 6.3.2), hence it
will either accept the value and send a "Confirm R", or it will send an
empty Confirm (which will reset the connection according to FN rules).
* An Ack is scheduled directly after activation to accelerate communicating
the update to the peer.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
76f738a79 dccp: Debugging functions for feature negotiation ... Browse Code »

Since all feature-negotiation processing now takes place in feat.c, functions
for producing verbose debugging output are concentrated there.

New functions to print out values, entry records, and options are provided,
and also a macro is defined to not always have the function name in the
output line.

Thanks a lot to Wei Yongjun and Giuseppe Galeota for help with errors in an
earlier revision of this patch.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
0a4822679 dccp: Initialisation and type-checking of feature sysctls ... Browse Code »

This patch takes care of initialising and type-checking sysctls related to
feature negotiation. Type checking is important since some of the sysctls
now directly act on the feature-negotiation process.

The sysctls are initialised with the known default values for each feature.
For the type-checking the value constraints from RFC 4340 are used:

* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.

Further changes:
----------------
Performed s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
51c7d4fa2 dccp: Implement both feature-local and feature-remote Sequence Window feature ... Browse Code »

This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.

Specifically, the following changes are introduced:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.

Until the handshake completes with activating negotiated values, the local/remote
Sequence-Window values are undefined and thus can not reliably be estimated.
This issue is addressed in a separate patch.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:32 +0800
09856c108 dccp: Auto-load (when supported) CCID plugins for negotiation ... Browse Code »

This adds auto-loading of CCIDs (when module loading is enabled)
for the purpose of feature negotiation.

The problem with loading the CCIDs at the end of feature negotiation is
that this would happen in software interrupt context. Besides, if the host
advertises CCIDs during negotiation, it should have them ready to use, in
case an agreeing peer wants to use it for the connection.

Signed-off-by: Gerrit Renker
Signed-off-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:31 +0800
5d3dac267 dccp: Initialisation framework for feature negotiation ... Browse Code »

This initialises feature negotiation from two tables, which are initialised
from sysctls.

As a novel feature, specifics of the implementation (e.g. currently short
seqnos and ECN are not supported) are advertised for robustness.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:31 +0800
b235dc4ab dccp ccid-2: Phase out the use of boolean Ack Vector sysctl ... Browse Code »

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, which is now handled fully dynamically via feature negotiation;
i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:31 +0800
68e074bfc dccp: Remove manual influence on NDP Count feature ... Browse Code »

Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).

This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.

At startup the initialisation of the NDP count feature is with the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:31 +0800
78673e24d dccp: Remove obsolete parts of the old CCID interface ... Browse Code »

The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
case, their value equals initially that of the sysctl, but at the end of
feature negotiation may be something different.

The old interface removed by this patch thus has been replaced by the newer
interface to dynamically query the currently loaded CCIDs earlier in this
patch set.

Also removed the constructors for the TX CCID and the RX CCID, since the
switch rx/non-rx is done by the handler in minisocks.c (and the handler is
the only place in the code where CCIDs are loaded).

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:31 +0800
23479cbfd dccp: Clean up old feature-negotiation infrastructure ... Browse Code »

The code removed by this patch is no longer referenced or used, the added
lines update documentation and copyrights.

Signed-off-by: Gerrit Renker
Acked-by: Ian McDonald

Gerrit Renker
2008-09-04 13:45:30 +0800