02 Mar, 2009

2 commits

  • This fixes a problem caused by the overlap of the connection-setup and
    established-state phases of DCCP connections.

    During connection setup, the client retransmits Confirm Feature-Negotiation
    options until a response from the server signals that it can move from the
    half-established PARTOPEN into the OPEN state, whereupon the connection is
    fully established on both ends (RFC 4340, 8.1.5).

    However, since the client may already send data while it is in the PARTOPEN
    state, consequences arise for the Maximum Packet Size: the problem is that the
    initial option overhead is much higher than for the subsequent established
    phase, as it involves potentially many variable-length list-type options
    (server-priority options, RFC 4340, 6.4).

    Applying the standard MPS is insufficient here: especially with larger
    payloads this can lead to annoying, counter-intuitive EMSGSIZE errors.

    On the other hand, reducing the MPS available for the established phase by
    the added initial overhead is highly wasteful and inefficient.

    The solution chosen therefore is a two-phase strategy:

    If the payload length of the DataAck in PARTOPEN is too large, an Ack is sent
    to carry the options, and the feature-negotiation list is then flushed.

    This means that the server gets two Acks for one Response. If both Acks get
    lost, it is probably better to restart the connection anyway and devising yet
    another special-case does not seem worth the extra complexity.

    The result is a higher utilisation of the available packet space for the data
    transmission phase (established state) of a connection.

    The patch (over-)estimates the initial overhead to be 32*4 bytes -- commonly
    seen values were around 90 bytes for initial feature-negotiation options.

    It uses sizeof(u32) to mean "aligned units of 4 bytes".
    For consistency, another use of 4-byte alignment is adapted.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch resolves a long-standing FIXME to dynamically update the Maximum
    Packet Size depending on actual options usage.

    It uses the flags set by the feature-negotiation infrastructure to compute
    the required header option size.

    Most options are fixed-size, a notable exception are Ack Vectors (required
    currently only by CCID-2). These can have any length between 3 and 1020
    bytes. As a result of testing, 16 bytes (2 bytes for type/length plus 14 Ack
    Vector cells) have been found to be sufficient for loss-free situations.

    There are currently no CCID-specific header options which may appear on data
    packets, thus it is not necessary to define a corresponding CCID field as
    suggested in the old comment.

    Further changes:
    ----------------
    Adjusted the type of 'cur_mps' to match the unsigned return type of the
    function.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

22 Jan, 2009

4 commits

  • Since all feature-negotiation processing now takes place in feat.c,
    functions for producing verbose debugging output are concentrated
    there.

    New functions to print out values, entry records, and options are
    provided, and also a macro is defined to not always have the function
    name in the output line.

    Thanks a lot to Wei Yongjun and Giuseppe Galeota for help and
    discussion with an earlier revision of this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch takes care of initialising and type-checking sysctls
    related to feature negotiation. Type checking is important since some
    of the sysctls now directly impact the feature-negotiation process.

    The sysctls are initialised with the known default values for each
    feature. For the type-checking the value constraints from RFC 4340
    are used:

    * Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
    tested and confirmed that it works up to 4294967295 - for Gbps speed;
    * Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
    * CCIDs are between 0 .. 255;
    * request_retries, retries1, retries2 also between 0..255 for good measure;
    * tx_qlen is checked to be non-negative;
    * sync_ratelimit remains as before.

    Notes:
    ------
    1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.
    2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in
    other places, sometimes with exactly the same kind of definitions (e.g.
    "static int zero;"). It may be a good idea (kernel janitors?) to consolidate
    type checking. For the sake of keeping the changeset small and in order not
    to affect other subsystems, I have not strived to generalise here.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This adds full support for local/remote Sequence Window feature, from which the
    * sequence-number-validity (W) and
    * acknowledgment-number-validity (W') windows
    derive as specified in RFC 4340, 7.5.3.

    Specifically, the following is contained in this patch:
    * integrated new socket fields into dccp_sk;
    * updated the update_gsr/gss routines with regard to these fields;
    * updated handler code: the Sequence Window feature is located at the TX side,
    so the local feature is meant if the handler-rx flag is false;
    * the initialisation of `rcv_wnd' in reqsk is removed, since
    - rcv_wnd is not used by the code anywhere;
    - sequence number checks are not done in the LISTEN state (cf. 7.5.3);
    - dccp_check_req checks the Ack number validity more rigorously;
    * the `struct dccp_minisock' became empty and is now removed.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This initialises feature negotiation from two tables, which are in
    turn are initialised from sysctls.

    As a novel feature, specifics of the implementation (e.g. that short
    seqnos and ECN are not yet available) are advertised for robustness.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

11 Jan, 2009

2 commits

  • Thanks to Wei and Arnaldo for pointing out the correct
    new reference for CCID-3.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Removed the __exit annotation of tfrc_lib_exit(), in order to suppress the following section mismatch messages:

    WARNING: net/dccp/dccp.o(.text+0xd9): Section mismatch in reference from the function ccid_cleanup_builtins() to the function .exit.text:tfrc_lib_exit()
    The function ccid_cleanup_builtins() references a function in an exit section.
    Often the function tfrc_lib_exit() has valid usage outside the exit section
    and the fix is to remove the __exit annotation of tfrc_lib_exit.

    WARNING: net/dccp/dccp.o(.init.text+0x48): Section mismatch in reference from the function ccid_initialize_builtins() to the function .exit.text:tfrc_lib_exit()
    The function __init ccid_initialize_builtins() references
    a function __exit tfrc_lib_exit().
    This is often seen when error handling in the init function
    uses functionality in the exit path.
    The fix is often to remove the __exit annotation of
    tfrc_lib_exit() so it may be used outside an exit section.

    Signed-off-by: Leonardo Potenza
    Acked-by: Arnaldo Carvalho de Melo
    Acked-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Leonardo Potenza
     

05 Jan, 2009

3 commits

  • This patch integrates the TFRC library, which is a dependency of CCID-3 (and
    CCID-4), with the new use of CCIDs in the DCCP module.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch cleans up after integrating the CCID modules and, in addition,

    * moves the if/else cases from ccid_delete() into ccid_hc_{tx,rx}_delete();
    * removes the 'gfp' argument to ccid_new() - since it is always gfp_any().

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Based on Arnaldo's earlier patch, this patch integrates the standardised
    CCID congestion control plugins (CCID-2 and CCID-3) of DCCP with dccp.ko:

    * enables a faster connection path by eliminating the need to always go
    through the CCID registration lock;

    * updates the implementation to use only a single array whose size equals
    the number of configured CCIDs instead of the maximum (256);

    * since the CCIDs are now fixed array elements, synchronization is no
    longer needed, simplifying use and implementation.

    CCID-2 is suggested as minimum for a basic DCCP implementation (RFC 4340, 10);
    CCID-3 is a standards-track CCID supported by RFC 4342 and RFC 5348.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

30 Dec, 2008

1 commit

  • When we converted the protocol atomic counters such as the orphan
    count and the total socket count deadlocks were introduced due to
    the mismatch in BH status of the spots that used the percpu counter
    operations.

    Based on the diagnosis and patch by Peter Zijlstra, this patch
    fixes these issues by disabling BH where we may be in process
    context.

    Reported-by: Jeff Kirsher
    Tested-by: Ingo Molnar
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

18 Dec, 2008

1 commit

  • And thus when we try to use 'ss -danemi' on these sockets that have no
    ccid blocks (data collected using systemtap after I fixed the problem):

    dccp_diag_get_info sk=0xffff8801220a3100, dp->dccps_hc_rx_ccid=0x0000000000000000, dp->dccps_hc_tx_ccid=0x0000000000000000

    We get an OOPS:

    mica.ghostprotocols.net login: BUG: unable to handle kernel NULL pointer
    dereferenc0
    IP: [] dccp_diag_get_info+0x82/0xc0 [dccp_diag]
    PGD 12106f067 PUD 122488067 PMD 0
    Oops: 0000 [#1] PREEMPT

    Fix is trivial, and 'ss -d' is working again:

    [root@mica ~]# ss -danemi
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    LISTEN 0 0 *:5001 *:*
    ino:7288 sk:220a3100ffff8801
    mem:(r0,w0,f0,t0) cwnd:0 ssthresh:0
    [root@mica ~]#

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

08 Dec, 2008

7 commits

  • This removes the use of the sysctl and the minisock variable for the Send Ack
    Vector feature, as it now is handled fully dynamically via feature negotiation
    (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
    RFC 4341, 4.).

    Using a sysctl in parallel to this implementation would open the door to
    crashes, since much of the code relies on tests of the boolean minisock /
    sysctl variable. Thus, this patch replaces all tests of type

    if (dccp_msk(sk)->dccpms_send_ack_vector)
    /* ... */
    with
    if (dp->dccps_hc_rx_ackvec != NULL)
    /* ... */

    The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
    negotiation concluded that Ack Vectors are to be used on the half-connection.
    Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
    so that the test is a valid one.

    The activation handler for Ack Vectors is called as soon as the feature
    negotiation has concluded at the
    * server when the Ack marking the transition RESPOND => OPEN arrives;
    * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

    Adding the sequence number of the Response packet to the Ack Vector has been
    removed, since
    (a) connection establishment implies that the Response has been received;
    (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
    this entry will always be ignored;
    (c) it can not be used for anything useful - to detect loss for instance, only
    packets received after the loss can serve as pseudo-dupacks.

    There was a FIXME to change the error code when dccp_ackvec_add() fails.
    I removed this after finding out that:
    * the check whether ackno < ISN is already made earlier,
    * this Response is likely the 1st packet with an Ackno that the client gets,
    * so when dccp_ackvec_add() fails, the reason is likely not a packet error.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Updating the NDP count feature is handled automatically now:
    * for CCID-2 it is disabled, since the code does not use NDP counts;
    * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

    Allowing the user to change NDP values leads to unpredictable and failing
    behaviour, since it is then possible to disable NDP counts even when they
    are needed (e.g. in CCID-3).

    This means that only those user settings are sensible that agree with the
    values for Send NDP Count implied by the choice of CCID. But those settings
    are already activated by the feature negotiation (CCID dependency tracking),
    hence this form of support is redundant.

    At startup the initialisation of the NDP count feature uses the default
    value of 0, which is done implicitly by the zeroing-out of the socket when
    it is allocated. If the choice of CCID or feature negotiation enables NDP
    count, this will then be updated via the NDP activation handler.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
    case, their value equals initially that of the sysctl, but at the end of
    feature negotiation may be something different.

    The old interface removed by this patch thus has been replaced by the newer
    interface to dynamically query the currently loaded CCIDs.

    Also removed are the constructors for the TX CCID and the RX CCID, since the
    switch "rx non-rx" is done by the handler in minisocks.c (and the handler
    is the only place in the code where CCIDs are loaded).

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • The code removed by this patch is no longer referenced or used, the added
    lines update documentation and copyrights.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This integrates feature-activation in the client:

    1. When dccp_parse_options() fails, the reset code is already set; request_sent\
    _state_process() currently overrides this with `Packet Error', which is not
    intended - changed to use the reset code supplied by dccp_parse_options().

    2. When feature negotiation fails, the socket should be marked as not usable,
    so that the application is notified that an error occurred. This is achieved
    by a new label 'unable_to_proceed': generating an error code of `Aborted',
    setting the socket state to CLOSED, returning with ECOMM in sk_err.

    3. Avoids parsing the Ack twice in Respond state by not doing option processing
    again in dccp_rcv_respond_partopen_state_process (as option processing has
    already been done on the request_sock in dccp_check_req).

    Since this addresses congestion-control initialisation, a corresponding
    FIXME has been removed.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch integrates the activation of features at the end of negotiation
    into the server-side code.

    Note regarding the removal of 'const':
    --------------------------------------
    The 'const' attribute has been removed from 'dreq' since dccp_activate_values()
    needs to operate on dreq's feature list. Part of the activation is to remove
    those options from the list that have already been confirmed, hence it is not
    purely read-only.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This first patch out of three replaces the hardcoded default settings with
    initialisation code for the dynamic feature negotiation.

    The patch also ensures that the client feature-negotiation queue is flushed
    only when entering the OPEN state.

    Since confirmed Change options are removed as soon as they are confirmed
    (in the DCCP-Response), this ensures that Confirm options are retransmitted.

    Note on retransmitting Confirm options:
    ---------------------------------------
    Implementation experience showed that it is necessary to retransmit Confirm
    options. Thanks to Leandro Melo de Sales who reported a bug in an earlier
    revision of the patch set, resulting from not retransmitting these options.

    As long as the client is in PARTOPEN, it needs to retransmit the Confirm
    options for the Change options received on the DCCP-Response from the server.

    Otherwise, if the packet containing the Confirm options gets dropped in the
    network, the connection aborts due to undefined feature negotiation state.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

06 Dec, 2008

1 commit


02 Dec, 2008

6 commits

  • This patch provides the post-processing of feature negotiation state, after
    the negotiation has completed.

    To this purpose, handlers are used and added to the dccp_feat_table. Each
    handler is passed a boolean flag whether the RX or TX side of the feature
    is meant.

    Several handlers are provided already, new handlers can easily be added.

    The initialisation is now fully dynamic, i.e. CCIDs are activated only
    after the feature negotiation. The integration of this dynamic activation
    is done in the subsequent patches.

    Thanks to Wei Yongjun for pointing out the necessity of skipping over empty
    Confirm options while copying the negotiated feature values.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Analogous to the previous patch, this adds code to interpret incoming Confirm
    feature-negotiation options. Both functions operate on the feature-negotiation
    list of either the request_sock (server) or the dccp_sock (client).

    Thanks to Wei Yongjun for pointing out that it is overly restrictive to check
    the entire list of confirmed SP values.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This adds/replaces code for processing incoming ChangeL/R options.
    The main difference is that:
    * mandatory FN options are now interpreted inside the function
    (there are too many individual cases to do this externally);
    * the function returns an appropriate Reset code or 0,
    which is then used to fill in the data for the Reset packet.

    Old code, which is no longer used or referenced, has been removed.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This provides two functions to
    * reconcile preference lists (with appropriate return codes) and
    * reorder the preference list if successful reconciliation changed the
    preferred value.

    The patch also removes the old code for processing SP/NN Change options, since
    new code to process these is mostly there already; related references have been
    commented out.

    The code for processing Change options follows in the next patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • The patch implements insertion of feature negotiation at the server (listening
    and request socket) and the client (connecting socket).

    In dccp_insert_options(), several statements have been grouped together now
    to achieve (it is hoped) better efficiency by reducing the number of tests
    each packet has to go through:
    - Ack Vectors are sent if the packet is neither a Data or a Request packet;
    - a previous issue is corrected - feature negotiation options are allowed
    on DataAck packets (5.8).

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch replaces the earlier insertion routine from options.c, so that
    code specific to feature negotiation can remain in feat.c. This is possible
    by calling a function already existing in options.c.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

26 Nov, 2008

3 commits

  • Instead of using one atomic_t per protocol, use a percpu_counter
    for "orphan_count", to reduce cache line contention on
    heavy duty network servers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
    to flow_cache_lookup() and resolver callback.

    Take it from socket or netdevice. Stub DECnet to init_net.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • this warning:

    net/dccp/options.c: In function ‘dccp_parse_options’:
    net/dccp/options.c:67: warning: ‘value’ may be used uninitialized in this function

    is a bogus GCC warning. The compiler does not recognize the relation
    between "value" and "mandatory" variables: the code flow can ever reach
    the "out_invalid_option:" label if 'mandatory' is set to 1, and when
    'mandatory' is non-zero, we'll always have 'value' initialized.

    Help out the compiler by annotating the variable.

    Signed-off-by: Ingo Molnar
    Signed-off-by: David S. Miller

    Ingo Molnar
     

24 Nov, 2008

5 commits

  • The patch extends existing code:
    * Confirm options divide into the confirmed value plus an optional preference
    list for SP values. Previously only the preference list was echoed for SP
    values, now the confirmed value is added as per RFC 4340, 6.1;
    * length and sanity checks are added to avoid illegal memory (or NULL) access.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Support for Mandatory options is provided by this patch, which will
    be used by subsequent feature-negotiation patches.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This extends the scope of two available functions,
    encode|decode_value_var, to work up to 6 (8) bytes, to match maximum
    requirements in the RFC.

    These functions are going to be used both by general option processing
    and feature negotiation code, hence declarations have been put into
    feat.h.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This provides function to query the current TX/RX CCID dynamically,
    without reliance on the minisock value, using dynamic information
    available in the currently loaded CCID module.

    This query function is then used to
    (a) provide the getsockopt part for getting/setting CCIDs via sockopts;
    (b) replace the current test for "which CCID is in use" in probe.c.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • With this patch, TX/RX CCIDs can now be changed on a per-connection
    basis, which overrides the defaults set by the global sysctl variables
    for TX/RX CCIDs.

    To make full use of this facility, the remaining patches of this patch
    set are needed, which track dependencies and activate negotiated
    feature values.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

20 Nov, 2008

3 commits


17 Nov, 2008

2 commits

  • This splits the setsockopt calls into two groups, depending on whether an
    integer argument (val) is required and whether routines being called do
    their own locking.

    Some options (such as setting the CCID) use u8 rather than int, so that for
    these the test with regard to integer-sizeof can not be used.

    The second switch-case statement now only has those statements which need
    locking and which make use of `val'.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Acked-by: Arnaldo Carvalho de Melo
    Reviewed-by: Eugene Teo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch deprecates the Ack Ratio sysctl, since
    * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
    * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
    * even if it would work in CCID-2, there is no point for a user to change it:
    - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
    - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
    (since waiting for Acks which will never arrive in this window),
    - cwnd is not a user-configurable value.

    The only reasonable place for Ack Ratio is to print it for debugging. It is
    planned to do this later on, as part of e.g. dccp_probe.

    With this patch Ack Ratio is now under full control of feature negotiation:
    * Ack Ratio is resolved as a dependency of the selected CCID;
    * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
    the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
    Ratio 2 for both endpoints";
    * what happens then is part of another patch set, since it concerns the
    dynamic update of Ack Ratio while the connection is in full flight.

    Thanks to Tomasz Grobelny for discussion leading up to this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker