31 Dec, 2011

1 commit

  • During some debugging I needed to look into how /proc/net/ipv6_route
    operated and in my digging I found its calling fib6_clean_all() which uses
    "write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
    found this on 2.6.32, but reading the code I believe the same basic idea
    exists currently. Looking at the rtnetlink code they are only calling
    "read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
    reading from proc isn't the recommended way of fetching the ipv6 route
    table; taking a write lock seems unnecessary and would probably cause
    network performance issues.

    To verify this I loaded up the ipv6 route table and then ran iperf in 3
    cases:
    * doing nothing
    * reading ipv6 route table via proc
    (while :; do cat /proc/net/ipv6_route > /dev/null; done)
    * reading ipv6 route table via rtnetlink
    (while :; do ip -6 route show table all > /dev/null; done)

    * Load the ipv6 route table up with:
    * for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done

    * iperf commands:
    * client: iperf -i 1 -V -c
    * server: iperf -V -s

    * iperf results - 3 runs each (in Mbits/sec)
    * nothing: client: 927,927,927 server: 927,927,927
    * proc: client: 179,97,96,113 server: 142,112,133
    * iproute: client: 928,927,928 server: 927,927,927

    lock_stat shows taking the write lock is causing the slowdown. Using this
    info I decided to write a version of fib6_clean_all() which replaces
    write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
    this new function I see the same results as with my rtnetlink iperf test.

    Signed-off-by: Josh Hunt
    Signed-off-by: David S. Miller

    Josh Hunt
     

29 Dec, 2011

1 commit


18 Jul, 2011

1 commit


25 Apr, 2011

1 commit

  • These header files are never installed to user consumption, so any
    __KERNEL__ cpp checks are superfluous.

    Projects should also not copy these files into their userland utility
    sources and try to use them there. If they insist on doing so, the
    onus is on them to sanitize the headers as needed.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Apr, 2011

1 commit


16 Apr, 2011

1 commit


13 Mar, 2011

1 commit


11 Feb, 2011

1 commit

  • If we didn't have a routing cache, we would not be able to properly
    propagate certain kinds of dynamic path attributes, for example
    PMTU information and redirects.

    The reason is that if we didn't have a routing cache, then there would
    be no way to lookup all of the active cached routes hanging off of
    sockets, tunnels, IPSEC bundles, etc.

    Consider the case where we created a cached route, but no inetpeer
    entry existed and also we were not asked to pre-COW the route metrics
    and therefore did not force the creation a new inetpeer entry.

    If we later get a PMTU message, or a redirect, and store this
    information in a new inetpeer entry, there is no way to teach that
    cached route about the newly existing inetpeer entry.

    The facilities implemented here handle this problem.

    First we create a generation ID. When we create a cached route of any
    kind, we remember the generation ID at the time of attachment. Any
    time we force-create an inetpeer entry in response to new path
    information, we bump that generation ID.

    The dst_ops->check() callback is where the knowledge of this event
    is propagated. If the global generation ID does not equal the one
    stored in the cached route, and the cached route has not attached
    to an inetpeer yet, we look it up and attach if one is found. Now
    that we've updated the cached route's information, we update the
    route's generation ID too.

    This clears the way for implementing PMTU and redirects directly in
    the inetpeer cache. There is absolutely no need to consult cached
    route information in order to maintain this information.

    At this point nothing bumps the inetpeer genids, that comes in the
    later changes which handle PMTUs and redirects using inetpeers.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2010

1 commit


11 Jun, 2010

1 commit


02 Apr, 2010

1 commit

  • The head element of rt6_info{} is dst_entry{}, and
    IPv6 specific elements follow.

    Because elements at the end of dst_entry{} are frequently
    updated, it is not good to put frequently-used static
    elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the
    same cache line.

    On the other hand, fib6_table, rt6i_node or rt6i_gateway are
    rarely used, so it is okay to stay in the same cache line.

    Let's rearrange rt6_info{}.

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki / 吉藤英明
     

19 Feb, 2010

1 commit


13 Feb, 2010

1 commit

  • When the fib size exceeds what can be dumped in a single skb, the
    dump is suspended and resumed once the last skb has been received
    by userspace. When the fib is changed while the dump is suspended,
    the walker might contain stale pointers, causing a crash when the
    dump is resumed.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] fib6_walk_continue+0xbb/0x124 [ipv6]
    PGD 5347a067 PUD 65c7067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    ...
    RIP: 0010:[]
    [] fib6_walk_continue+0xbb/0x124 [ipv6]
    ...
    Call Trace:
    [] ? mutex_spin_on_owner+0x59/0x71
    [] inet6_dump_fib+0x11b/0x1b9 [ipv6]
    [] netlink_dump+0x5b/0x19e
    [] ? consume_skb+0x28/0x2a
    [] netlink_recvmsg+0x1ab/0x2c6
    [] ? netlink_unicast+0xfa/0x151
    [] __sock_recvmsg+0x6d/0x79
    [] sock_recvmsg+0xca/0xe3
    [] ? autoremove_wake_function+0x0/0x38
    [] ? radix_tree_lookup_slot+0xe/0x10
    [] ? find_get_page+0x90/0xa5
    [] ? filemap_fault+0x201/0x34f
    [] ? fget_light+0x2f/0xac
    [] ? verify_iovec+0x4f/0x94
    [] sys_recvmsg+0x14d/0x223

    Store the serial number when beginning to walk the fib and reload
    pointers when continuing to walk after a change occured. Similar
    to other dumping functions, this might cause unrelated entries to
    be missed when entries are deleted.

    Tested-by: Ben Greear
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jul, 2009

1 commit

  • Choose saner defaults for xfrm[4|6] gc_thresh values on init

    Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
    (set to 1024). Given that the ipv4 and ipv6 routing caches are sized
    dynamically at boot time, the static selections can be non-sensical.
    This patch dynamically selects an appropriate gc threshold based on
    the corresponding main routing table size, using the assumption that
    we should in the worst case be able to handle as many connections as
    the routing table can.

    For ipv4, the maximum route cache size is 16 * the number of hash
    buckets in the route cache. Given that xfrm4 starts garbage
    collection at the gc_thresh and prevents new allocations at 2 *
    gc_thresh, we set gc_thresh to half the maximum route cache size.

    For ipv6, its a bit trickier. there is no maximum route cache size,
    but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems
    sane to select a simmilar gc_thresh for the xfrm6 code that is half
    the number of hash buckets in the v6 route cache times 16 (like the v4
    code does).

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

05 Mar, 2008

1 commit


04 Mar, 2008

3 commits

  • The fib tables are now relative to the network namespace. When the
    garbage collector timer expires, we must have a network namespace
    parameter in order to retrieve the tables. For now this is the
    init_net, but we should be able to have a timer per namespace and use
    the timer callback parameter to pass the network namespace from the
    expired timer.

    The timer callback, fib6_run_gc, is actually used to be called
    synchronously by some functions and asynchronously when the timer
    expires.

    When the timer expires, the delay specified for fib6_run_gc parameter
    is always zero. So, I changed fib6_run_gc to not be a timer callback
    but a function called by the timer callback and I added a timer
    callback where its work is just to retrieve from the data arg of the
    timer the network namespace and call fib6_run_gc with zero expiring
    time and the network namespace parameters. That makes the code cleaner
    for the fib6_run_gc callers.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The function fib6_clean_all takes the network namespace as
    parameter. That allows to flush the routes related to a specific
    network namespace.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The fib table for ipv6 are moved to the network namespace structure.
    All references to them are made relatively to the network namespace.

    All external calls to the ip6_fib functions taking the network
    namespace parameter are made using the init_net variable, so the
    ip6_fib engine is ready for the namespaces but the callers not yet.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     

08 Feb, 2008

1 commit


29 Jan, 2008

5 commits

  • IPv6 specific thing is wrongly removed from transformation at net-2.6.25.
    This patch recovers it with current design.

    o Update "path" of xfrm_dst since IPv6 transformation should
    care about routing changes. It is required by MIPv6 and
    off-link destined IPsec.
    o Rename nfheader_len which is for non-fragment transformation used by
    MIPv6 to rt6i_nfheader_len as IPv6 name space.

    Signed-off-by: Masahide NAKAMURA
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Masahide NAKAMURA
     
  • The patch defines the usual static inline functions when the code is
    disabled for fib6_rules. That's allow to remove some ifdef in route.c
    file and make the code a little more clear.

    Signed-off-by: Daniel Lezcano
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • When the fib_rules initialization finished, no return code is provided
    so there is no way to know, for the caller, if the initialization has
    been successful or has failed. This patch fix that.

    Signed-off-by: Daniel Lezcano
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • If there is an error in the initialization function, nothing is
    followed up to the caller. So I add a return value to be set for the
    init function.

    Signed-off-by: Daniel Lezcano
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The dst member nfheader_len is only used by IPv6. It's also currently
    creating a rather ugly alignment hole in struct dst. Therefore this patch
    moves it from there into struct rt6_info.

    It also reorders the fields in rt6_info to minimize holes.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 Oct, 2007

1 commit

  • When XFRM policy and state are ready after TCP connection is started,
    the traffic should be transformed immediately, however it does not
    on IPv6 TCP.

    It depends on a dst cache replacement policy with connected socket.
    It seems that the replacement is always done for IPv4, however, on
    IPv6 case it is done only when routing cookie is changed.

    This patch fix that non-transformation dst can be changed to
    transformation one.
    This behavior is required by MIPv6 and improves IPv6 IPsec.

    Fixes by Masahide NAKAMURA.

    Signed-off-by: Noriaki TAKAMIYA
    Signed-off-by: Masahide NAKAMURA
    Signed-off-by: David S. Miller

    Noriaki TAKAMIYA
     

26 Apr, 2007

1 commit


26 Mar, 2007

1 commit

  • As per RFC2461, section 6.3.6, item #2, when no routers on the
    matching list are known to be reachable or probably reachable we
    do round robin on those available routes so that we make sure
    to probe as many of them as possible to detect when one becomes
    reachable faster.

    Each routing table has a rwlock protecting the tree and the linked
    list of routes at each leaf. The round robin code executes during
    lookup and thus with the rwlock taken as a reader. A small local
    spinlock tries to provide protection but this does not work at all
    for two reasons:

    1) The round-robin list manipulation, as coded, goes like this (with
    read lock held):

    walk routes finding head and tail

    spin_lock();
    rotate list using head and tail
    spin_unlock();

    While one thread is rotating the list, another thread can
    end up with stale values of head and tail and then proceed
    to corrupt the list when it gets the lock. This ends up causing
    the OOPS in fib6_add() later onthat many people have been hitting.

    2) All the other code paths that run with the rwlock held as
    a reader do not expect the list to change on them, they
    expect it to remain completely fixed while they hold the
    lock in that way.

    So, simply stated, it is impossible to implement this correctly using
    a manipulation of the list without violating the rwlock locking
    semantics.

    Reimplement using a per-fib6_node round-robin pointer. This way we
    don't need to manipulate the list at all, and since the round-robin
    pointer can only ever point to real existing entries we don't need
    to perform any locking on the changing of the round-robin pointer
    itself. We only need to reset the round-robin pointer to NULL when
    the entry it is pointing to is removed.

    The idea is from Thomas Graf and it is very similar to how this
    was implemented before the advanced router selection code when in.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Feb, 2007

1 commit


14 Dec, 2006

1 commit


03 Dec, 2006

1 commit


23 Sep, 2006

7 commits


22 Jun, 2005

1 commit

  • Essentially netlink at the moment always reports a pid and sequence of 0
    always for v6 route activities.
    To understand the repurcassions of this look at:
    http://lists.quagga.net/pipermail/quagga-dev/2005-June/003507.html

    While fixing this, i took the liberty to resolve the outstanding issue
    of IPV6 routes inserted via ioctls to have the correct pids as well.

    This patch tries to behave as close as possible to the v4 routes i.e
    maintains whatever PID the socket issuing the command owns as opposed to
    the process. That made the patch a little bulky.

    I have tested against both netlink derived utility to add/del routes as
    well as ioctl derived one. The Quagga folks have tested against quagga.
    This fixes the problem and so far hasnt been detected to introduce any
    new issues.

    Signed-off-by: Jamal Hadi Salim
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds