12 Dec, 2011

1 commit


06 Dec, 2011

1 commit


19 Nov, 2011

1 commit

  • ip_gre: Set needed_headroom dynamically again

    Now that all needed_headroom users have been fixed up so that
    we can safely increase needed_headroom, this patch restore the
    dynamic update of needed_headroom.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

09 Nov, 2011

1 commit

  • Tunnels can force an alignment of their percpu data to reduce number of
    cache lines used in fast path, or read in .ndo_get_stats()

    percpu_alloc() is a very fine grained allocator, so any small hole will
    be used anyway.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2011

1 commit

  • It seems ip_gre is able to change dev->needed_headroom on the fly.

    Its is not legal unfortunately and triggers a BUG in raw_sendmsg()

    skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev)

    < another cpu change dev->needed_headromm (making it bigger)

    ...
    skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

    We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
    -> we crash later because skb head is exhausted.

    Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
    header_len in max_headroom calculation)

    Reported-by: Elmar Vonlanthen
    Signed-off-by: Eric Dumazet
    CC: Timo Teräs
    CC: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2011

1 commit


06 May, 2011

1 commit

  • Force dev_alloc_name() to be called from register_netdevice() by
    dev_get_valid_name(). That allows to remove multiple explicit
    dev_alloc_name() calls.

    The possibility to call dev_alloc_name in advance remains.

    This also fixes veth creation regresion caused by
    84c49d8c3e4abefb0a41a77b25aa37ebe8d6b743

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

05 May, 2011

1 commit


23 Apr, 2011

1 commit


13 Mar, 2011

1 commit


11 Mar, 2011

1 commit


10 Mar, 2011

1 commit

  • Since a8f80e8ff94ecba629542d9b4b5f5a8ee3eb565c any process with
    CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean
    that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are
    limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't
    allow anybody load any module not related to networking.

    This patch restricts an ability of autoloading modules to netdev modules
    with explicit aliases. This fixes CVE-2011-1019.

    Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior
    of loading netdev modules by name (without any prefix) for processes
    with CAP_SYS_MODULE to maintain the compatibility with network scripts
    that use autoloading netdev modules by aliases like "eth0", "wlan0".

    Currently there are only three users of the feature in the upstream
    kernel: ipip, ip_gre and sit.

    root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
    root@albatros:~# grep Cap /proc/$$/status
    CapInh: 0000000000000000
    CapPrm: fffffff800001000
    CapEff: fffffff800001000
    CapBnd: fffffff800001000
    root@albatros:~# modprobe xfs
    FATAL: Error inserting xfs
    (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit
    sit: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit0
    sit0 Link encap:IPv6-in-IPv4
    NOARP MTU:1480 Metric:1

    root@albatros:~# lsmod | grep sit
    sit 10457 0
    tunnel4 2957 1 sit

    For CAP_SYS_MODULE module loading is still relaxed:

    root@albatros:~# grep Cap /proc/$$/status
    CapInh: 0000000000000000
    CapPrm: ffffffffffffffff
    CapEff: ffffffffffffffff
    CapBnd: ffffffffffffffff
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    xfs 745319 0

    Reference: https://lkml.org/lkml/2011/2/24/203

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Michael Tokarev
    Acked-by: David S. Miller
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Vasiliy Kulikov
     

03 Mar, 2011

1 commit


12 Feb, 2011

1 commit


13 Dec, 2010

2 commits

  • Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

    This allowed several simplifications:

    1) The interim dst_metric_hoplimit() can go as it's no longer
    userd.

    2) The sysctl_ip_default_ttl entry no longer needs to use
    ipv4_doint_and_flush, since the sysctl is not cached in
    routing cache metrics any longer.

    3) ipv4_doint_and_flush no longer needs to be exported and
    therefore can be marked static.

    When ipv4_doint_and_flush_strategy was removed some time ago,
    the external declaration in ip.h was mistakenly left around
    so kill that off too.

    We have to move the sysctl_ip_default_ttl declaration into
    ipv4's route cache definition header net/route.h, because
    currently net/ip.h (where the declaration lives now) has
    a back dependency on net/route.h

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2010

1 commit

  • Use helper functions to hide all direct accesses, especially writes,
    to dst_entry metrics values.

    This will allow us to:

    1) More easily change how the metrics are stored.

    2) Implement COW for metrics.

    In particular this will help us put metrics into the inetpeer
    cache if that is what we end up doing. We can make the _metrics
    member a pointer instead of an array, initially have it point
    at the read-only metrics in the FIB, and then on the first set
    grab an inetpeer entry and point the _metrics member there.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

02 Dec, 2010

2 commits


18 Nov, 2010

1 commit


16 Nov, 2010

1 commit

  • The GRE Key field is intended to be used for identifying an individual
    traffic flow within a tunnel. It is useful to be able to have XFRM
    policy selector matches to have different policies for different
    GRE tunnels.

    Signed-off-by: Timo Teräs
    Signed-off-by: David S. Miller

    Timo Teräs
     

12 Nov, 2010

1 commit


31 Oct, 2010

1 commit

  • Before making the fallback tunnel visible to lookups, we should make
    sure it is completely setup, once ipgre_tunnel_init() had been called
    and tstats per_cpu pointer allocated.

    move rcu_assign_pointer(ign->tunnels_wc[0], tunnel); from
    ipgre_fb_tunnel_init() to ipgre_init_net()

    Based on a patch from Pavel Emelyanov

    Reported-by: Pavel Emelyanov
    Signed-off-by: Eric Dumazet
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Oct, 2010

1 commit

  • After making rcu protection for tunnels (ipip, gre, sit and ip6) a bug
    was introduced into the SIOCCHGTUNNEL code.

    The tunnel is first unlinked, then addresses change, then it is linked
    back probably into another bucket. But while changing the parms, the
    hash table is unlocked to readers and they can lookup the improper tunnel.

    Respective commits are b7285b79 (ipip: get rid of ipip_lock), 1507850b
    (gre: get rid of ipgre_lock), 3a43be3c (sit: get rid of ipip6_lock) and
    94767632 (ip6tnl: get rid of ip6_tnl_lock).

    The quick fix is to wait for quiescent state to pass after unlinking,
    but if it is inappropriate I can invent something better, just let me
    know.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

19 Oct, 2010

1 commit


06 Oct, 2010

1 commit

  • In various situations, a device provides a packet to our stack and we
    drop it before it enters protocol stack :
    - softnet backlog full (accounted in /proc/net/softnet_stat)
    - bad vlan tag (not accounted)
    - unknown/unregistered protocol (not accounted)

    We can handle a per-device counter of such dropped frames at core level,
    and automatically adds it to the device provided stats (rx_dropped), so
    that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)

    This is a generalization of commit 8990f468a (net: rx_dropped
    accounting), thus reverting it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2010

2 commits

  • HARD_TX_LOCK no longer protects tunnels from dead loops,
    but xmit_recursion percpu counter.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • GRE tunnels can benefit from lockless xmits, using NETIF_F_LLTX

    Note: If tunnels are created with the "oseq" option, LLTX is not
    enabled :

    Even using an atomic_t o_seq, we would increase chance for packets being
    out of order at receiver.

    Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
    10000000 UDP frames via one gre tunnel (size:200 bytes per frame)

    Before patch :
    real 3m0.094s
    user 0m9.365s
    sys 47m50.103s

    After patch:
    real 0m29.756s
    user 0m11.097s
    sys 7m33.012s

    Last problem to solve is the contention on dst :

    38660.00 21.4% __ip_route_output_key vmlinux
    20786.00 11.5% dst_release vmlinux
    14191.00 7.8% __xfrm_lookup vmlinux
    12410.00 6.9% ip_finish_output vmlinux
    4540.00 2.5% ip_push_pending_frames vmlinux
    4427.00 2.4% ip_append_data vmlinux
    4265.00 2.4% __alloc_skb vmlinux
    4140.00 2.3% __ip_local_out vmlinux
    3991.00 2.2% dev_queue_xmit vmlinux

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2010

1 commit

  • Le lundi 27 septembre 2010 à 14:29 +0100, Ben Hutchings a écrit :

    > > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
    > > index 5d6ddcb..de39b22 100644
    > > --- a/net/ipv4/ip_gre.c
    > > +++ b/net/ipv4/ip_gre.c
    > [...]
    > > @@ -377,7 +405,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net,
    > > if (parms->name[0])
    > > strlcpy(name, parms->name, IFNAMSIZ);
    > > else
    > > - sprintf(name, "gre%%d");
    > > + strcpy(name, "gre%d");
    > >
    > > dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup);
    > > if (!dev)
    > [...]
    >
    > This is a valid fix, but doesn't belong in this patch!
    >

    Sorry ? It was not a fix, but at most a cleanup ;)

    Anyway I forgot the gretap case...

    [PATCH 2/4 v2] ip_gre: percpu stats accounting

    Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.

    Other seldom used fields are kept in netdev->stats structure, possibly
    unsafe.

    This is a preliminary work to support lockless transmit path, and
    correct RX stats, that are already unsafe.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2010

1 commit


24 Sep, 2010

1 commit


21 Sep, 2010

2 commits

  • Under load, netif_rx() can drop incoming packets but administrators dont
    have a chance to spot which device needs some tuning (RPS activation for
    example)

    This patch adds rx_dropped accounting in vlans and tunnels.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ipv6 can be a module, we should test CONFIG_IPV6 and CONFIG_IPV6_MODULE
    to enable ipv6 bits in ip_gre.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Sep, 2010

1 commit

  • As RTNL is held while doing tunnels inserts and deletes, we can remove
    ipgre_lock spinlock. My initial RCU conversion was conservative and
    converted the rwlock to spinlock, with no RTNL requirement.

    Use appropriate rcu annotations and modern lockdep checks as well.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Aug, 2010

1 commit

  • PPP: introduce "pptp" module which implements point-to-point tunneling protocol using pppox framework
    NET: introduce the "gre" module for demultiplexing GRE packets on version criteria
    (required to pptp and ip_gre may coexists)
    NET: ip_gre: update to use the "gre" module

    This patch introduces then pptp support to the linux kernel which
    dramatically speeds up pptp vpn connections and decreases cpu usage in
    comparison of existing user-space implementation
    (poptop/pptpclient). There is accel-pptp project
    (https://sourceforge.net/projects/accel-pptp/) to utilize this module,
    it contains plugin for pppd to use pptp in client-mode and modified
    pptpd (poptop) to build high-performance pptp NAS.

    There was many changes from initial submitted patch, most important are:
    1. using rcu instead of read-write locks
    2. using static bitmap instead of dynamically allocated
    3. using vmalloc for memory allocation instead of BITS_PER_LONG + __get_free_pages
    4. fixed many coding style issues
    Thanks to Eric Dumazet.

    Signed-off-by: Dmitry Kozlov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Dmitry Kozlov
     

09 Jul, 2010

1 commit

  • This patch makes IPV6 over IPv4 GRE tunnel propagate the transport
    class field from the underlying IPV6 header to the IPV4 Type Of Service
    field. Without the patch, all IPV6 packets in tunnel look the same to QoS.

    This assumes that IPV6 transport class is exactly the same
    as IPv4 TOS. Not sure if that is always the case? Maybe need
    to mask off some bits.

    The mask and shift to get tclass is copied from ipv6/datagram.c

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

11 Jun, 2010

1 commit


18 May, 2010

2 commits

  • This patch removes from net/ (but not any netfilter files)
    all the unnecessary return; statements that precede the
    last closing brace of void functions.

    It does not remove the returns that are immediately
    preceded by a label as gcc doesn't like that.

    Done via:
    $ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
    xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • skb rxhash should be cleared when a skb is handled by a tunnel before
    being delivered again, so that correct packet steering can take place.

    There are other cleanups and accounting that we can factorize in a new
    helper, skb_tunnel_rx()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo