20 Oct, 2011

1 commit


03 Aug, 2011

1 commit

  • Gergely Kalman reported crashes in check_peer_redir().

    It appears commit f39925dbde778 (ipv4: Cache learned redirect
    information in inetpeer.) added a race, leading to possible NULL ptr
    dereference.

    Since we can now change dst neighbour, we should make sure a reader can
    safely use a neighbour.

    Add RCU protection to dst neighbour, and make sure check_peer_redir()
    can be called safely by different cpus in parallel.

    As neighbours are already freed after one RCU grace period, this patch
    should not add typical RCU penalty (cache cold effects)

    Many thanks to Gergely for providing a pretty report pointing to the
    bug.

    Reported-by: Gergely Kalman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2011

2 commits


10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

03 May, 2011

1 commit

  • Four years ago, Patrick made a change to hold rtnl mutex during netlink
    dump callbacks.

    I believe it was a wrong move. This slows down concurrent dumps, making
    good old /proc/net/ files faster than rtnetlink in some situations.

    This occurred to me because one "ip link show dev ..." was _very_ slow
    on a workload adding/removing network devices in background.

    All dump callbacks are able to use RCU locking now, so this patch does
    roughly a revert of commits :

    1c2d670f366 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
    6313c1e0992 : [RTNETLINK]: Remove unnecessary locking in dump callbacks

    This let writers fight for rtnl mutex and readers going full speed.

    It also takes care of phonet : phonet_route_get() is now called from rcu
    read section. I renamed it to phonet_route_get_rcu()

    Signed-off-by: Eric Dumazet
    Cc: Patrick McHardy
    Cc: Remi Denis-Courmont
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


13 Mar, 2011

1 commit


17 Oct, 2010

1 commit

  • While doing profile analysis, I found fib_hash_table was sometime in a
    cache line shared by a possibly often written kernel structure.

    (CONFIG_IP_ROUTE_MULTIPATH || !CONFIG_IPV6_MULTIPLE_TABLES)

    It's hard to detect because not easily reproductible.

    Make sure we allocate a full cache line to keep this shared in all cpus
    caches.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2010

1 commit


21 Apr, 2010

1 commit


12 Apr, 2010

1 commit


31 Mar, 2010

1 commit

  • addr_bit_test() is used in various places in IPv6 routing table
    subsystem. It checks if the given fn_bit is set,
    where fn_bit counts bits from MSB in words in network-order.

    fn_bit : 0 .... 31 32 .... 64 65 .... 95 96 ....127

    fn_bit >> 5 gives offset of word, and (~fn_bit & 0x1f) gives
    count from LSB in the network-endian word in question.

    fn_bit >> 5 : 0 1 2 3
    ~fn_bit & 0x1f: 31 .... 0 31 .... 0 31 .... 0 31 .... 0

    Thus, the mask was generated as htonl(1 << (~fn_bit & 0x1f)).
    This can be optimized by "sweezle" (See include/asm-generic/bitops/le.h).

    In little-endian,
    htonl(1 << bit) = 1 << (bit ^ BITOP_BE32_SWIZZLE)
    where
    BITOP_BE32_SWIZZLE is (0x1f & ~7)
    So,
    htonl(1 << (~fn_bit & 0x1f)) = 1 << ((~fn_bit & 0x1f) ^ (0x1f & ~7))
    = 1 << ((~fn_bit ^ ~7) & 0x1f)
    = 1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)

    In big-endian, BITOP_BE32_SWIZZLE is equal to 0.
    1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)
    = 1 << ((~fn_bit) & 0x1f)
    = htonl(1 << (~fn_bit & 0x1f))

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki / 吉藤英明
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

19 Feb, 2010

1 commit


13 Feb, 2010

1 commit

  • When the fib size exceeds what can be dumped in a single skb, the
    dump is suspended and resumed once the last skb has been received
    by userspace. When the fib is changed while the dump is suspended,
    the walker might contain stale pointers, causing a crash when the
    dump is resumed.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] fib6_walk_continue+0xbb/0x124 [ipv6]
    PGD 5347a067 PUD 65c7067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    ...
    RIP: 0010:[]
    [] fib6_walk_continue+0xbb/0x124 [ipv6]
    ...
    Call Trace:
    [] ? mutex_spin_on_owner+0x59/0x71
    [] inet6_dump_fib+0x11b/0x1b9 [ipv6]
    [] netlink_dump+0x5b/0x19e
    [] ? consume_skb+0x28/0x2a
    [] netlink_recvmsg+0x1ab/0x2c6
    [] ? netlink_unicast+0xfa/0x151
    [] __sock_recvmsg+0x6d/0x79
    [] sock_recvmsg+0xca/0xe3
    [] ? autoremove_wake_function+0x0/0x38
    [] ? radix_tree_lookup_slot+0xe/0x10
    [] ? find_get_page+0x90/0xa5
    [] ? filemap_fault+0x201/0x34f
    [] ? fget_light+0x2f/0xac
    [] ? verify_iovec+0x4f/0x94
    [] sys_recvmsg+0x14d/0x223

    Store the serial number when beginning to walk the fib and reload
    pointers when continuing to walk after a change occured. Similar
    to other dumping functions, this might cause unrelated entries to
    be missed when entries are deleted.

    Tested-by: Ben Greear
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

18 Jan, 2010

1 commit


31 Jul, 2009

1 commit

  • Choose saner defaults for xfrm[4|6] gc_thresh values on init

    Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
    (set to 1024). Given that the ipv4 and ipv6 routing caches are sized
    dynamically at boot time, the static selections can be non-sensical.
    This patch dynamically selects an appropriate gc threshold based on
    the corresponding main routing table size, using the assumption that
    we should in the worst case be able to handle as many connections as
    the routing table can.

    For ipv4, the maximum route cache size is 16 * the number of hash
    buckets in the route cache. Given that xfrm4 starts garbage
    collection at the gc_thresh and prevents new allocations at 2 *
    gc_thresh, we set gc_thresh to half the maximum route cache size.

    For ipv6, its a bit trickier. there is no maximum route cache size,
    but the ipv6 dst_ops gc_thresh is statically set to 1024. It seems
    sane to select a simmilar gc_thresh for the xfrm6 code that is half
    the number of hash buckets in the v6 route cache times 16 (like the v4
    code does).

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

14 Jan, 2009

1 commit

  • When a fib6 table dump is prematurely ended, we won't unlink
    its walker from the list. This causes all sorts of grief for
    other users of the list later.

    Reported-by: Chris Caputo
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 Aug, 2008

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

23 Jul, 2008

5 commits


22 Jul, 2008

1 commit

  • This fixes the bridge reference count problem and cleanups ipv6 FIB
    timer management. Don't use expires field, because it is not a proper
    way to test, instead use timer_pending().

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

12 Jun, 2008

1 commit


22 Apr, 2008

1 commit


18 Apr, 2008

1 commit


26 Mar, 2008

1 commit


05 Mar, 2008

2 commits


04 Mar, 2008

7 commits

  • The rt6_stats is now per namespace with this patch. It is allocated
    when a network namespace is created and freed when the network
    namespace exits and references are relative to the network namespace.

    Signed-off-by: Benjamin Thery
    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Benjamin Thery
     
  • This patch allocates the rt6_stats struct dynamically when the fib6 is
    initialized. That provides the ability to create several instances of
    this structure for the network namespaces.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The fib6_clean_node function should have the network namespace it is
    working on. The fib6_cleaner_t structure is extended with the network
    namespace field to be passed to the fib6_clean_node function.

    The different functions calling the fib6_clean_node function are
    extended with the netns parameter when needed to propagate the netns
    pointer.

    Signed-off-by: Benjamin Thery
    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Benjamin Thery
     
  • Move the timer initialization at the network namespace creation and
    store the network namespace in the timer argument.

    That enables multiple timers (one per network namespace) to do garbage
    collecting.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The ip6_fib_timer gc timer is dynamically allocated and initialized in
    the ip6 fib init function. There are no more references to a static
    global variable. That will allow to make multiple instance of the
    garbage collecting timer and make them per namespace.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The fib tables are now relative to the network namespace. When the
    garbage collector timer expires, we must have a network namespace
    parameter in order to retrieve the tables. For now this is the
    init_net, but we should be able to have a timer per namespace and use
    the timer callback parameter to pass the network namespace from the
    expired timer.

    The timer callback, fib6_run_gc, is actually used to be called
    synchronously by some functions and asynchronously when the timer
    expires.

    When the timer expires, the delay specified for fib6_run_gc parameter
    is always zero. So, I changed fib6_run_gc to not be a timer callback
    but a function called by the timer callback and I added a timer
    callback where its work is just to retrieve from the data arg of the
    timer the network namespace and call fib6_run_gc with zero expiring
    time and the network namespace parameters. That makes the code cleaner
    for the fib6_run_gc callers.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The function fib6_clean_all takes the network namespace as
    parameter. That allows to flush the routes related to a specific
    network namespace.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano