24 Oct, 2011

1 commit


22 Jun, 2011

1 commit


11 May, 2011

1 commit


04 May, 2011

1 commit


23 Apr, 2011

1 commit


08 Apr, 2011

1 commit

  • Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more
    sanely on output routes.") breaks rt_is_{output,input}_route.

    This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0".

    To fix it, this does:

    1) Add "int rt_route_iif;" to struct rtable

    2) For input routes, always set rt_route_iif to same value as rt_iif

    3) For output routes, always set rt_route_iif to zero. Set rt_iif
    as it is done currently.

    4) Change rt_is_{output,input}_route() to test rt_route_iif

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: David S. Miller

    OGAWA Hirofumi
     

13 Mar, 2011

5 commits


05 Mar, 2011

1 commit


03 Mar, 2011

1 commit


02 Mar, 2011

1 commit


24 Feb, 2011

1 commit


23 Feb, 2011

2 commits


27 Jan, 2011

1 commit

  • Routing metrics are now copy-on-write.

    Initially a route entry points it's metrics at a read-only location.
    If a routing table entry exists, it will point there. Else it will
    point at the all zero metric place-holder called 'dst_default_metrics'.

    The writeability state of the metrics is stored in the low bits of the
    metrics pointer, we have two bits left to spare if we want to store
    more states.

    For the initial implementation, COW is implemented simply via kmalloc.
    However future enhancements will change this to place the writable
    metrics somewhere else, in order to increase sharing. Very likely
    this "somewhere else" will be the inetpeer cache.

    Note also that this means that metrics updates may transiently fail
    if we cannot COW the metrics successfully.

    But even by itself, this patch should decrease memory usage and
    increase cache locality especially for routing workloads. In those
    cases the read-only metric copies stay in place and never get written
    to.

    TCP workloads where metrics get updated, and those rare cases where
    PMTU triggers occur, will take a very slight performance hit. But
    that hit will be alleviated when the long-term writable metrics
    move to a more sharable location.

    Since the metrics storage went from a u32 array of RTAX_MAX entries to
    what is essentially a pointer, some retooling of the dst_entry layout
    was necessary.

    Most importantly, we need to preserve the alignment of the reference
    count so that it doesn't share cache lines with the read-mostly state,
    as per Eric Dumazet's alignment assertion checks.

    The only non-trivial bit here is the move of the 'flags' member into
    the writeable cacheline. This is OK since we are always accessing the
    flags around the same moment when we made a modification to the
    reference count.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Nov, 2010

1 commit


16 Nov, 2010

1 commit

  • The GRE Key field is intended to be used for identifying an individual
    traffic flow within a tunnel. It is useful to be able to have XFRM
    policy selector matches to have different policies for different
    GRE tunnels.

    Signed-off-by: Timo Teräs
    Signed-off-by: David S. Miller

    Timo Teräs
     

12 Nov, 2010

1 commit

  • It seems idev field in struct rtable has no special purpose, but adding
    extra atomic ops.

    We hold refcounts on the device itself (using percpu data, so pretty
    cheap in current kernel).

    infiniband case is solved using dst.dev instead of idev->dev

    Removal of this field means routing without route cache is now using
    shared data, percpu data, and only potential contention is a pair of
    atomic ops on struct neighbour per forwarded packet.

    About 5% speedup on routing test.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hal Rosenstock
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Oct, 2010

1 commit

  • struct dst_ops tracks number of allocated dst in an atomic_t field,
    subject to high cache line contention in stress workload.

    Switch to a percpu_counter, to reduce number of time we need to dirty a
    central location. Place it on a separate cache line to avoid dirtying
    read only fields.

    Stress test :

    (Sending 160.000.000 UDP frames,
    IP route cache disabled, dual E5540 @2.53GHz,
    32bit kernel, FIB_TRIE, SLUB/NUMA)

    Before:

    real 0m51.179s
    user 0m15.329s
    sys 10m15.942s

    After:

    real 0m45.570s
    user 0m15.525s
    sys 9m56.669s

    With a small reordering of struct neighbour fields, subject of a
    following patch, (to separate refcnt from other read mostly fields)

    real 0m41.841s
    user 0m15.261s
    sys 8m45.949s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2010

1 commit


08 Jul, 2010

1 commit


05 Jul, 2010

1 commit

  • While using xfrm by MARK feature in
    2.6.34 - 2.6.35 kernels, the mark
    is always cleared in flowi structure via memset in
    _decode_session4 (net/ipv4/xfrm4_policy.c), so
    the policy lookup fails.
    IPv6 code is affected by this bug too.

    Signed-off-by: Peter Kosyh
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Peter Kosyh
     

11 Jun, 2010

1 commit


07 Apr, 2010

1 commit

  • __xfrm_lookup() is called for each packet transmitted out of
    system. The xfrm_find_bundle() does a linear search which can
    kill system performance depending on how many bundles are
    required per policy.

    This modifies __xfrm_lookup() to store bundles directly in
    the flow cache. If we did not get a hit, we just create a new
    bundle instead of doing slow search. This means that we can now
    get multiple xfrm_dst's for same flow (on per-cpu basis).

    Signed-off-by: Timo Teras
    Signed-off-by: David S. Miller

    Timo Teräs
     

03 Mar, 2010

1 commit

  • When I merged the bundle creation code, I introduced a bogus
    flowi value in the bundle. Instead of getting from the caller,
    it was instead set to the flow in the route object, which is
    totally different.

    The end result is that the bundles we created never match, and
    we instead end up with an ever growing bundle list.

    Thanks to Jamal for find this problem.

    Reported-by: Jamal Hadi Salim
    Signed-off-by: Herbert Xu
    Acked-by: Steffen Klassert
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Herbert Xu
     

25 Jan, 2010

1 commit

  • GC is non-existent in netns, so after you hit GC threshold, no new
    dst entries will be created until someone triggers cleanup in init_net.

    Make xfrm4_dst_ops and xfrm6_dst_ops per-netns.
    This is not done in a generic way, because it woule waste
    (AF_MAX - 2) * sizeof(struct dst_ops) bytes per-netns.

    Reorder GC threshold initialization so it'd be done before registering
    XFRM policies.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

12 Nov, 2009

1 commit

  • Now that sys_sysctl is a compatiblity wrapper around /proc/sys
    all sysctl strategy routines, and all ctl_name and strategy
    entries in the sysctl tables are unused, and can be
    revmoed.

    In addition neigh_sysctl_register has been modified to no longer
    take a strategy argument and it's callers have been modified not
    to pass one.

    Cc: "David Miller"
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

05 Aug, 2009

1 commit

  • Fix build errors when SYSCTLs are not enabled:
    (.init.text+0x5154): undefined reference to `net_ipv4_ctl_path'
    (.init.text+0x5176): undefined reference to `register_net_sysctl_table'
    xfrm4_policy.c:(.exit.text+0x573): undefined reference to `unregister_net_sysctl_table

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

31 Jul, 2009

1 commit

  • Choose saner defaults for xfrm[4|6] gc_thresh values on init

    Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
    (set to 1024). Given that the ipv4 and ipv6 routing caches are sized
    dynamically at boot time, the static selections can be non-sensical.
    This patch dynamically selects an appropriate gc threshold based on
    the corresponding main routing table size, using the assumption that
    we should in the worst case be able to handle as many connections as
    the routing table can.

    For ipv4, the maximum route cache size is 16 * the number of hash
    buckets in the route cache. Given that xfrm4 starts garbage
    collection at the gc_thresh and prevents new allocations at 2 *
    gc_thresh, we set gc_thresh to half the maximum route cache size.

    For ipv6, its a bit trickier. there is no maximum route cache size,
    but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems
    sane to select a simmilar gc_thresh for the xfrm6 code that is half
    the number of hash buckets in the v6 route cache times 16 (like the v4
    code does).

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

28 Jul, 2009

1 commit

  • Export garbage collector thresholds for xfrm[4|6]_dst_ops

    Had a problem reported to me recently in which a high volume of ipsec
    connections on a system began reporting ENOBUFS for new connections
    eventually.

    It seemed that after about 2000 connections we started being unable to
    create more. A quick look revealed that the xfrm code used a dst_ops
    structure that limited the gc_thresh value to 1024, and always
    dropped route cache entries after 2x the gc_thresh.

    It seems the most direct solution is to export the gc_thresh values in
    the xfrm[4|6] dst_ops as sysctls, like the main routing table does, so
    that higher volumes of connections can be supported. This patch has
    been tested and allows the reporter to increase their ipsec connection
    volume successfully.

    Reported-by: Joe Nall
    Signed-off-by: Neil Horman

    ipv4/xfrm4_policy.c | 18 ++++++++++++++++++
    ipv6/xfrm6_policy.c | 18 ++++++++++++++++++
    2 files changed, 36 insertions(+)
    Signed-off-by: David S. Miller

    Neil Horman
     

04 Jul, 2009

1 commit

  • The SCTP pushed the skb data above the sctp chunk header, so the check
    of pskb_may_pull(skb, xprth + 4 - skb->data) in _decode_session4() will
    never return 0 because xprth + 4 - skb->data < 0, the ports decode of
    sctp will always fail.

    Signed-off-by: Wei Yongjun
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Wei Yongjun
     

01 Feb, 2009

1 commit


26 Nov, 2008

3 commits


12 Nov, 2008

1 commit


03 Nov, 2008

1 commit