09 Mar, 2013

1 commit

  • This patch requires multicast interface-scoped addresses to supply a
    sin6_scope_id. Because the sin6_scope_id is now also correctly used
    in case of interface-scoped multicast traffic this enables one to use
    interface scoped addresses over interfaces which are not targeted by the
    default multicast route (the route has to be put there manually, though).

    getsockname() and getpeername() now return the correct sin6_scope_id in
    case of interface-local mc addresses.

    v2:
    a) rebased ontop of patch 1/4 (now uses ipv6_addr_props)

    v3:
    a) reverted changes for ipv6_addr_props

    v4:
    a) unchanged

    Cc: YOSHIFUJI Hideaki
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: YOSHIFUJI Hideaki dave
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

06 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

30 Jan, 2013

1 commit

  • When attempting to build linux-next with user namespaces enabled I ran
    into this fun build error.

    CC net/ipv6/inet6_connection_sock.o
    .../net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_bind_conflict’:
    .../net/ipv6/inet6_connection_sock.c:37:12: error: incompatible types when initializing type ‘int’ using
    type ‘kuid_t’
    .../net/ipv6/inet6_connection_sock.c:54:30: error: incompatible type for argument 1 of ‘uid_eq’
    .../include/linux/uidgid.h:48:20: note: expected ‘kuid_t’ but argument is of type ‘int’
    make[3]: *** [net/ipv6/inet6_connection_sock.o] Error 1
    make[2]: *** [net/ipv6] Error 2
    make[2]: *** Waiting for unfinished jobs....

    Using kuid_t instead of int to hold the uid fixes this.

    Cc: Tom Herbert
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

24 Jan, 2013

1 commit

  • Motivation for soreuseport would be something like a web server
    binding to port 80 running with multiple threads, where each thread
    might have it's own listener socket. This could be done as an
    alternative to other models: 1) have one listener thread which
    dispatches completed connections to workers. 2) accept on a single
    listener socket from multiple threads. In case #1 the listener thread
    can easily become the bottleneck with high connection turn-over rate.
    In case #2, the proportion of connections accepted per thread tends
    to be uneven under high connection load (assuming simple event loop:
    while (1) { accept(); process() }, wakeup does not promote fairness
    among the sockets. We have seen the disproportion to be as high
    as 3:1 ratio between thread accepting most connections and the one
    accepting the fewest. With so_reusport the distribution is
    uniform.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

21 Nov, 2012

1 commit


19 Sep, 2012

1 commit

  • IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or
    deleted, all dst should be invalidated.
    To force the validation, dst entries should be created with ->obsolete set to
    DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling
    ip6_dst_alloc(), except for ip6_rt_copy().

    As a consequence, we can remove the specific code in inet6_connection_sock.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

18 Jul, 2012

1 commit

  • We should provide to inet6_csk_route_socket a struct flowi6 pointer,
    so that net6_csk_xmit() works correctly instead of sending garbage.

    Also add some consts

    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jul, 2012

1 commit

  • This will be used so that we can compose a full flow key.

    Even though we have a route in this context, we need more. In the
    future the routes will be without destination address, source address,
    etc. keying. One ipv4 route will cover entire subnets, etc.

    In this environment we have to have a way to possess persistent storage
    for redirects and PMTU information. This persistent storage will exist
    in the FIB tables, and that's why we'll need to be able to rebuild a
    full lookup flow key here. Using that flow key will do a fib_lookup()
    and create/update the persistent entry.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Jul, 2012

1 commit


29 Jun, 2012

2 commits

  • This commit changes inet_csk_route_req() so that it uses a pointer to
    a struct flowi6, rather than allocating its own on the stack. This
    brings its behavior in line with its IPv4 cousin,
    inet_csk_route_req(), and allows a follow-on patch to fix a dst leak.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif,
    which is correctly fixed up in tcp_v6_conn_request() to handle the
    case of link-local addresses. This brings it in line with the
    tcp_v6_send_synack() code, which is already correctly using the
    treq->iif in this way.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

15 Apr, 2012

1 commit

  • We must try harder to get unique (addr, port) pairs when
    doing port autoselection for sockets with SO_REUSEADDR
    option set.

    We achieve this by adding a relaxation parameter to
    inet_csk_bind_conflict. When 'relax' parameter is off
    we return a conflict whenever the current searched
    pair (addr, port) is not unique.

    This tries to address the problems reported in patch:
    8d238b25b1ec22a73b1c2206f111df2faaff8285
    Revert "tcp: bind() fix when many ports are bound"

    Tests where ran for creating and binding(0) many sockets
    on 100 IPs. The results are, on average:

    * 60000 sockets, 600 ports / IP:
    * 0.210 s, 620 (IP, port) duplicates without patch
    * 0.219 s, no duplicates with patch
    * 100000 sockets, 1000 ports / IP:
    * 0.371 s, 1720 duplicates without patch
    * 0.373 s, no duplicates with patch
    * 200000 sockets, 2000 ports / IP:
    * 0.766 s, 6900 duplicates without patch
    * 0.768 s, no duplicates with patch
    * 500000 sockets, 5000 ports / IP:
    * 2.227 s, 41500 duplicates without patch
    * 2.284 s, no duplicates with patch

    Signed-off-by: Alex Copot
    Signed-off-by: Daniel Baluta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alex Copot
     

27 Nov, 2011

1 commit


24 Nov, 2011

1 commit

  • commit 72a3effaf633bc ([NET]: Size listen hash tables using backlog
    hint) added a bug allowing inet6_synq_hash() to return an out of bound
    array index, because of u16 overflow.

    Bug can happen if system admins set net.core.somaxconn &
    net.ipv4.tcp_max_syn_backlog sysctls to values greater than 65536

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Nov, 2011

1 commit


27 Oct, 2011

1 commit

  • commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from
    TIME_WAIT) fixed IPv4 only.

    This part is for the IPv6 side, adding a tclass param to ip6_xmit()

    We alias tw_tclass and tw_tos, if socket family is INET6.

    [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
    TOS field, TCLASS is not taken into account ]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Aug, 2011

1 commit


09 May, 2011

1 commit

  • This allows us to acquire the exact route keying information from the
    protocol, however that might be managed.

    It handles all of the possibilities, from the simplest case of storing
    the key in inet->cork.fl to the more complex setup SCTP has where
    individual transports determine the flow.

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Apr, 2011

1 commit

  • This reverts commit c191a836a908d1dd6b40c503741f91b914de3348.

    It causes known regressions for programs that expect to be able to use
    SO_REUSEADDR to shutdown a socket, then successfully rebind another
    socket to the same ID.

    Programs such as haproxy and amavisd expect this to work.

    This should fix kernel bugzilla 32832.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Mar, 2011

4 commits


02 Mar, 2011

1 commit

  • Route lookups follow a general pattern in the ipv6 code wherein
    we first find the non-IPSEC route, potentially override the
    flow destination address due to ipv6 options settings, and then
    finally make an IPSEC search using either xfrm_lookup() or
    __xfrm_lookup().

    __xfrm_lookup() is used when we want to generate a blackhole route
    if the key manager needs to resolve the IPSEC rules (in this case
    -EREMOTE is returned and the original 'dst' is left unchanged).

    Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
    resolution is necessary, we simply fail the lookup completely.

    All of these cases are encapsulated into two routines,
    ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which
    handles unconnected UDP datagram sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     

12 Jan, 2011

1 commit

  • inet_csk_bind_conflict() logic currently disallows a bind() if
    it finds a friend socket (a socket bound on same address/port)
    satisfying a set of conditions :

    1) Current (to be bound) socket doesnt have sk_reuse set
    OR
    2) other socket doesnt have sk_reuse set
    OR
    3) other socket is in LISTEN state

    We should add the CLOSE state in the 3) condition, in order to avoid two
    REUSEADDR sockets in CLOSE state with same local address/port, since
    this can deny further operations.

    Note : a prior patch tried to address the problem in a different (and
    buggy) way. (commit fda48a0d7a8412ced tcp: bind() fix when many ports
    are bound).

    Reported-by: Gaspar Chilingarov
    Reported-by: Daniel Baluta
    Tested-by: Daniel Baluta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • jhash is widely used in the kernel and because the functions
    are inlined, the cost in size is significant. Also, the new jhash
    functions are slightly larger than the previous ones so better un-inline.
    As a preparation step, the calls to the internal macros are replaced
    with the plain jhash function calls.

    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: David S. Miller

    Jozsef Kadlecsik
     

02 Jun, 2010

1 commit

  • There are more than a dozen occurrences of following code in the
    IPv6 stack:

    if (opt && opt->srcrt) {
    struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
    ipv6_addr_copy(&final, &fl.fl6_dst);
    ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
    final_p = &final;
    }

    Replace those with a helper. Note that the helper overrides final_p
    in all cases. This is ok as final_p was previously initialized to
    NULL when declared.

    Signed-off-by: Arnaud Ebalard
    Signed-off-by: David S. Miller

    Arnaud Ebalard
     

03 May, 2010

1 commit


29 Apr, 2010

1 commit

  • This reverts two commits:

    fda48a0d7a8412cedacda46a9c0bf8ef9cd13559
    tcp: bind() fix when many ports are bound

    and a follow-on fix for it:

    6443bb1fc2050ca2b6585a3fa77f7833b55329ed
    ipv6: Fix inet6_csk_bind_conflict()

    It causes problems with binding listening sockets when time-wait
    sockets from a previous instance still are alive.

    It's too late to keep fiddling with this so late in the -rc
    series, and we'll deal with it in net-next-2.6 instead.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Apr, 2010

1 commit


26 Apr, 2010

1 commit

  • Commit fda48a0d7a84 (tcp: bind() fix when many ports are bound)
    introduced a bug on IPV6 part.
    We should not call ipv6_addr_any(inet6_rcv_saddr(sk2)) but
    ipv6_addr_any(inet6_rcv_saddr(sk)) because sk2 can be IPV4, while sk is
    IPV6.

    Reported-by: Michael S. Tsirkin
    Signed-off-by: Eric Dumazet
    Tested-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2010

1 commit

  • Port autoselection done by kernel only works when number of bound
    sockets is under a threshold (typically 30000).

    When this threshold is over, we must check if there is a conflict before
    exiting first loop in inet_csk_get_port()

    Change inet_csk_bind_conflict() to forbid two reuse-enabled sockets to
    bind on same (address,port) tuple (with a non ANY address)

    Same change for inet6_csk_bind_conflict()

    Reported-by: Gaspar Chilingarov
    Signed-off-by: Eric Dumazet
    Acked-by: Evgeniy Polyakov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Apr, 2010

1 commit

  • As Herbert Xu said: we should be able to simply replace ipfragok
    with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
    has droped ipfragok and set local_df value properly.

    The patch kills the ipfragok parameter of .queue_xmit().

    Signed-off-by: Shan Wei
    Signed-off-by: David S. Miller

    Shan Wei
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

21 Oct, 2009

1 commit


19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Oct, 2009

1 commit

  • Atis Elsts wrote:
    > Not sure if there is need to fill the mark from skb in tunnel xmit functions. In any case, it's not done for GRE or IPIP tunnels at the moment.

    Ok, I'll just drop that part, I'm not sure what should be done in this case.

    > Also, in this patch you are doing that for SIT (v6-in-v4) tunnels only, and not doing it for v4-in-v6 or v6-in-v6 tunnels. Any reason for that?

    I just sent that patch out too quickly, here's a better one with the updates.

    Add support for IPv6 route lookups using sk_mark.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

03 Jun, 2009

1 commit

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet