31 Dec, 2011

1 commit

  • During some debugging I needed to look into how /proc/net/ipv6_route
    operated and in my digging I found its calling fib6_clean_all() which uses
    "write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
    found this on 2.6.32, but reading the code I believe the same basic idea
    exists currently. Looking at the rtnetlink code they are only calling
    "read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
    reading from proc isn't the recommended way of fetching the ipv6 route
    table; taking a write lock seems unnecessary and would probably cause
    network performance issues.

    To verify this I loaded up the ipv6 route table and then ran iperf in 3
    cases:
    * doing nothing
    * reading ipv6 route table via proc
    (while :; do cat /proc/net/ipv6_route > /dev/null; done)
    * reading ipv6 route table via rtnetlink
    (while :; do ip -6 route show table all > /dev/null; done)

    * Load the ipv6 route table up with:
    * for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done

    * iperf commands:
    * client: iperf -i 1 -V -c
    * server: iperf -V -s

    * iperf results - 3 runs each (in Mbits/sec)
    * nothing: client: 927,927,927 server: 927,927,927
    * proc: client: 179,97,96,113 server: 142,112,133
    * iproute: client: 928,927,928 server: 927,927,927

    lock_stat shows taking the write lock is causing the slowdown. Using this
    info I decided to write a version of fib6_clean_all() which replaces
    write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
    this new function I see the same results as with my rtnetlink iperf test.

    Signed-off-by: Josh Hunt
    Signed-off-by: David S. Miller

    Josh Hunt
     

29 Dec, 2011

1 commit


06 Dec, 2011

1 commit


04 Dec, 2011

1 commit

  • 1) x == NULL --> !x
    2) x != NULL --> x
    3) if() --> if ()
    4) while() --> while ()
    5) (x & BIT) == 0 --> !(x & BIT)
    6) (x&BIT) --> (x & BIT)
    7) x=y --> x = y
    8) (BIT1|BIT2) --> (BIT1 | BIT2)
    9) if ((x & BIT)) --> if (x & BIT)
    10) proper argument and struct member alignment

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Nov, 2011

2 commits


16 Nov, 2011

1 commit


15 Nov, 2011

1 commit

  • The support for NLM_F_* flags at IPv6 routing requests.

    If NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
    warning is printed, but no error is returned. Instead new route is
    added. Later NLM_F_CREATE may be required for
    new route creation.

    Exception is when NLM_F_REPLACE flag is given without NLM_F_CREATE, and
    no matching route is found. In this case it should be safe to assume
    that the request issuer is familiar with NLM_F_* flags, and does really
    not want route to be created.

    Specifying NLM_F_REPLACE flag will now make the kernel to search for
    matching route, and replace it with new one. If no route is found and
    NLM_F_CREATE is specified as well, then new route is created.

    Also, specifying NLM_F_EXCL will yield returning of error if matching
    route is found.

    Patch created against linux-3.2-rc1

    Signed-off-by: Matti Vaittinen
    Signed-off-by: David S. Miller

    Matti Vaittinen
     

20 Oct, 2011

1 commit


03 Aug, 2011

1 commit

  • Gergely Kalman reported crashes in check_peer_redir().

    It appears commit f39925dbde778 (ipv4: Cache learned redirect
    information in inetpeer.) added a race, leading to possible NULL ptr
    dereference.

    Since we can now change dst neighbour, we should make sure a reader can
    safely use a neighbour.

    Add RCU protection to dst neighbour, and make sure check_peer_redir()
    can be called safely by different cpus in parallel.

    As neighbours are already freed after one RCU grace period, this patch
    should not add typical RCU penalty (cache cold effects)

    Many thanks to Gergely for providing a pretty report pointing to the
    bug.

    Reported-by: Gergely Kalman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2011

2 commits


10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

03 May, 2011

1 commit

  • Four years ago, Patrick made a change to hold rtnl mutex during netlink
    dump callbacks.

    I believe it was a wrong move. This slows down concurrent dumps, making
    good old /proc/net/ files faster than rtnetlink in some situations.

    This occurred to me because one "ip link show dev ..." was _very_ slow
    on a workload adding/removing network devices in background.

    All dump callbacks are able to use RCU locking now, so this patch does
    roughly a revert of commits :

    1c2d670f366 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
    6313c1e0992 : [RTNETLINK]: Remove unnecessary locking in dump callbacks

    This let writers fight for rtnl mutex and readers going full speed.

    It also takes care of phonet : phonet_route_get() is now called from rcu
    read section. I renamed it to phonet_route_get_rcu()

    Signed-off-by: Eric Dumazet
    Cc: Patrick McHardy
    Cc: Remi Denis-Courmont
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


13 Mar, 2011

1 commit


17 Oct, 2010

1 commit

  • While doing profile analysis, I found fib_hash_table was sometime in a
    cache line shared by a possibly often written kernel structure.

    (CONFIG_IP_ROUTE_MULTIPATH || !CONFIG_IPV6_MULTIPLE_TABLES)

    It's hard to detect because not easily reproductible.

    Make sure we allocate a full cache line to keep this shared in all cpus
    caches.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2010

1 commit


21 Apr, 2010

1 commit


12 Apr, 2010

1 commit


31 Mar, 2010

1 commit

  • addr_bit_test() is used in various places in IPv6 routing table
    subsystem. It checks if the given fn_bit is set,
    where fn_bit counts bits from MSB in words in network-order.

    fn_bit : 0 .... 31 32 .... 64 65 .... 95 96 ....127

    fn_bit >> 5 gives offset of word, and (~fn_bit & 0x1f) gives
    count from LSB in the network-endian word in question.

    fn_bit >> 5 : 0 1 2 3
    ~fn_bit & 0x1f: 31 .... 0 31 .... 0 31 .... 0 31 .... 0

    Thus, the mask was generated as htonl(1 << (~fn_bit & 0x1f)).
    This can be optimized by "sweezle" (See include/asm-generic/bitops/le.h).

    In little-endian,
    htonl(1 << bit) = 1 << (bit ^ BITOP_BE32_SWIZZLE)
    where
    BITOP_BE32_SWIZZLE is (0x1f & ~7)
    So,
    htonl(1 << (~fn_bit & 0x1f)) = 1 << ((~fn_bit & 0x1f) ^ (0x1f & ~7))
    = 1 << ((~fn_bit ^ ~7) & 0x1f)
    = 1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)

    In big-endian, BITOP_BE32_SWIZZLE is equal to 0.
    1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)
    = 1 << ((~fn_bit) & 0x1f)
    = htonl(1 << (~fn_bit & 0x1f))

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki / 吉藤英明
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

19 Feb, 2010

1 commit


13 Feb, 2010

1 commit

  • When the fib size exceeds what can be dumped in a single skb, the
    dump is suspended and resumed once the last skb has been received
    by userspace. When the fib is changed while the dump is suspended,
    the walker might contain stale pointers, causing a crash when the
    dump is resumed.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] fib6_walk_continue+0xbb/0x124 [ipv6]
    PGD 5347a067 PUD 65c7067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    ...
    RIP: 0010:[]
    [] fib6_walk_continue+0xbb/0x124 [ipv6]
    ...
    Call Trace:
    [] ? mutex_spin_on_owner+0x59/0x71
    [] inet6_dump_fib+0x11b/0x1b9 [ipv6]
    [] netlink_dump+0x5b/0x19e
    [] ? consume_skb+0x28/0x2a
    [] netlink_recvmsg+0x1ab/0x2c6
    [] ? netlink_unicast+0xfa/0x151
    [] __sock_recvmsg+0x6d/0x79
    [] sock_recvmsg+0xca/0xe3
    [] ? autoremove_wake_function+0x0/0x38
    [] ? radix_tree_lookup_slot+0xe/0x10
    [] ? find_get_page+0x90/0xa5
    [] ? filemap_fault+0x201/0x34f
    [] ? fget_light+0x2f/0xac
    [] ? verify_iovec+0x4f/0x94
    [] sys_recvmsg+0x14d/0x223

    Store the serial number when beginning to walk the fib and reload
    pointers when continuing to walk after a change occured. Similar
    to other dumping functions, this might cause unrelated entries to
    be missed when entries are deleted.

    Tested-by: Ben Greear
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

18 Jan, 2010

1 commit


31 Jul, 2009

1 commit

  • Choose saner defaults for xfrm[4|6] gc_thresh values on init

    Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
    (set to 1024). Given that the ipv4 and ipv6 routing caches are sized
    dynamically at boot time, the static selections can be non-sensical.
    This patch dynamically selects an appropriate gc threshold based on
    the corresponding main routing table size, using the assumption that
    we should in the worst case be able to handle as many connections as
    the routing table can.

    For ipv4, the maximum route cache size is 16 * the number of hash
    buckets in the route cache. Given that xfrm4 starts garbage
    collection at the gc_thresh and prevents new allocations at 2 *
    gc_thresh, we set gc_thresh to half the maximum route cache size.

    For ipv6, its a bit trickier. there is no maximum route cache size,
    but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems
    sane to select a simmilar gc_thresh for the xfrm6 code that is half
    the number of hash buckets in the v6 route cache times 16 (like the v4
    code does).

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

14 Jan, 2009

1 commit

  • When a fib6 table dump is prematurely ended, we won't unlink
    its walker from the list. This causes all sorts of grief for
    other users of the list later.

    Reported-by: Chris Caputo
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 Aug, 2008

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

23 Jul, 2008

5 commits


22 Jul, 2008

1 commit

  • This fixes the bridge reference count problem and cleanups ipv6 FIB
    timer management. Don't use expires field, because it is not a proper
    way to test, instead use timer_pending().

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

12 Jun, 2008

1 commit


22 Apr, 2008

1 commit


18 Apr, 2008

1 commit


26 Mar, 2008

1 commit


05 Mar, 2008

1 commit