23 Aug, 2014

1 commit

  • Commit 7a9bc9b81a5b ("ipv4: Elide fib_validate_source() completely when possible.")
    introduced a short-circuit to avoid calling fib_validate_source when not
    needed. That change took rp_filter into account, but not accept_local.
    This resulted in a change of behaviour: with rp_filter and accept_local
    off, incoming packets with a local address in the source field should be
    dropped.

    Here is how to reproduce the change pre/post 7a9bc9b81a5b commit:
    -configure the same IPv4 address on hosts A and B.
    -try to send an ARP request from B to A.
    -The ARP request will be dropped before that commit, but accepted and answered
    after that commit.

    This adds a check for ACCEPT_LOCAL, to maintain full
    fib validation in case it is 0. We also leave __fib_validate_source() earlier
    when possible, based on the same check as fib_validate_source(), once the
    accept_local stuff is verified.

    Cc: Gregory Detal
    Cc: Christoph Paasch
    Cc: Hannes Frederic Sowa
    Cc: Sergei Shtylyov
    Signed-off-by: Sébastien Barré
    Signed-off-by: David S. Miller

    Sébastien Barré
     

17 Apr, 2014

1 commit

  • As suggested by Julian:

    Simply, flowi4_iif must not contain 0, it does not
    look logical to ignore all ip rules with specified iif.

    because in fib_rule_match() we do:

    if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
    goto out;

    flowi4_iif should be LOOPBACK_IFINDEX by default.

    We need to move LOOPBACK_IFINDEX to include/net/flow.h:

    1) It is mostly used by flowi_iif

    2) Fix the following compile error if we use it in flow.h
    by the patches latter:

    In file included from include/linux/netfilter.h:277:0,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:21,
    from include/linux/netdevice.h:43,
    from include/linux/icmpv6.h:12,
    from include/linux/ipv6.h:61,
    from include/net/ipv6.h:16,
    from include/linux/sunrpc/clnt.h:27,
    from include/linux/nfs_fs.h:30,
    from init/do_mounts.c:32:
    include/net/flow.h: In function ‘flowi4_init_output’:
    include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

25 Mar, 2014

1 commit


25 Jan, 2014

1 commit

  • The two commits 0115e8e30d (net: remove delay at device dismantle) and
    748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently
    removed a NULL pointer check for in_dev since Linux 3.7.

    This patch re-introduces this check as it causes crashing the kernel when
    setting small mtu values on non-ip capable netdevices.

    Signed-off-by: Oliver Hartkopp
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Oliver Hartkopp
     

19 Oct, 2013

1 commit


28 Jun, 2013

1 commit

  • Since (c05cdb1 netlink: allow large data transfers from user-space),
    netlink splats if it invokes skb_clone on large netlink skbs since:

    * skb_shared_info was not correctly initialized.
    * skb->destructor is not set in the cloned skb.

    This was spotted by trinity:

    [ 894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
    [ 894.991034] IP: [] skb_clone+0x24/0xc0
    [...]
    [ 894.991034] Call Trace:
    [ 894.991034] [] nl_fib_input+0x6a/0x240
    [ 894.991034] [] ? _raw_read_unlock+0x26/0x40
    [ 894.991034] [] netlink_unicast+0x169/0x1e0
    [ 894.991034] [] netlink_sendmsg+0x251/0x3d0

    Fix it by:

    1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
    that sets our special skb->destructor in the cloned skb. Moreover, handle
    the release of the large cloned skb head area in the destructor path.

    2) not allowing large skbuffs in the netlink broadcast path. I cannot find
    any reasonable use of the large data transfer using netlink in that path,
    moreover this helps to skip extra skb_clone handling.

    I found two more netlink clients that are cloning the skbs, but they are
    not in the sendmsg path. Therefore, the sole client cloning that I found
    seems to be the fib frontend.

    Thanks to Eric Dumazet for helping to address this issue.

    Reported-by: Fengguang Wu
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

29 May, 2013

1 commit

  • So far, only net_device * could be passed along with netdevice notifier
    event. This patch provides a possibility to pass custom structure
    able to provide info that event listener needs to know.

    Signed-off-by: Jiri Pirko

    v2->v3: fix typo on simeth
    shortened dev_getter
    shortened notifier_info struct name
    v1->v2: fix notifier_call parameter in call_netdevice_notifier()
    Signed-off-by: David S. Miller

    Jiri Pirko
     

29 Mar, 2013

1 commit


22 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

12 Jan, 2013

1 commit

  • In fib_frontend.c, there is a confusing comment; NETLINK_CB(skb).portid does not
    refer to a pid of sending process, but rather to a netlink portid.

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

19 Nov, 2012

3 commits

  • - Only allow moving network devices to network namespaces you have
    CAP_NET_ADMIN privileges over.

    - Enable creating/deleting/modifying interfaces
    - Enable adding/deleting addresses
    - Enable adding/setting/deleting neighbour entries
    - Enable adding/removing routes
    - Enable adding/removing fib rules
    - Enable setting the forwarding state
    - Enable adding/removing ipv6 address labels
    - Enable setting bridge parameter

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Settings that merely control a single network device are allowed.
    Either the network device is a logical network device where
    restrictions make no difference or the network device is hardware NIC
    that has been explicity moved from the initial network namespace.

    In general policy and network stack state changes are allowed
    while resource control is left unchanged.

    Allow creating raw sockets.
    Allow the SIOCSARP ioctl to control the arp cache.
    Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
    Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
    Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
    Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
    Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
    Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting gre tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipip tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipsec virtual tunnel interfaces.

    Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
    MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
    sockets.

    Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
    arbitrary ip options.

    Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
    Allow setting the IP_TRANSPARENT ipv4 socket option.
    Allow setting the TCP_REPAIR socket option.
    Allow setting the TCP_CONGESTION socket option.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
    to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
    users to make netlink calls to modify their local network
    namespace.

    - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
    that calls that are not safe for unprivileged users are still
    protected.

    Later patches will remove the extra capable calls from methods
    that are safe for unprivilged users.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

09 Oct, 2012

1 commit

  • After "Cache input routes in fib_info nexthops" (commit
    d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
    (commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
    should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
    the same fib_info can be used for traffic that is not redirected,
    eg. from other input devices or from sources that are not in same subnet.

    As result, we have to disable the caching of RTCF_DOREDIRECT
    flag and to force source validation for the case when forwarding
    traffic to the input device. If traffic comes from directly connected
    source we allow redirection as it was done before both changes.

    Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS
    is disabled, this can avoid source address validation and to
    help caching the routes.

    After the change "Adjust semantics of rt->rt_gateway"
    (commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
    contain daddr instead of 0.0.0.0 when target is directly connected.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

09 Sep, 2012

1 commit


08 Sep, 2012

1 commit


24 Aug, 2012

1 commit

  • Eric Biederman pointed out that not holding RTNL while calling
    call_netdevice_notifiers() was racy.

    This patch is a direct transcription his feedback
    against commit 0115e8e30d6fc (net: remove delay at device dismantle)

    Thanks Eric !

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2012

1 commit

  • I noticed extra one second delay in device dismantle, tracked down to
    a call to dst_dev_event() while some call_rcu() are still in RCU queues.

    These call_rcu() were posted by rt_free(struct rtable *rt) calls.

    We then wait a little (but one second) in netdev_wait_allrefs() before
    kicking again NETDEV_UNREGISTER.

    As the call_rcu() are now completed, dst_dev_event() can do the needed
    device swap on busy dst.

    To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
    after a rcu_barrier(), but outside of RTNL lock.

    Use NETDEV_UNREGISTER_FINAL with care !

    Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

    Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
    IP cache removal.

    With help from Gao feng

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2012

1 commit

  • As pointed out, there are places, that access net->loopback_dev->ifindex
    and after ifindex generation is made per-net this value becomes constant
    equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
    it where appropriate.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

01 Aug, 2012

1 commit

  • When a device is unregistered, we have to purge all of the
    references to it that may exist in the entire system.

    If a route is uncached, we currently have no way of accomplishing
    this.

    So create a global list that is scanned when a network device goes
    down. This mirrors the logic in net/core/dst.c's dst_ifdown().

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jul, 2012

1 commit


21 Jul, 2012

1 commit

  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    Signed-off-by: David S. Miller

    David S. Miller
     

19 Jul, 2012

1 commit


13 Jul, 2012

1 commit


06 Jul, 2012

1 commit

  • If the user hasn't actually installed any custom rules, or fiddled
    with the default ones, don't go through the whole FIB rules layer.

    It's just pure overhead.

    Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check
    the individual tables by hand, one by one.

    Also, move fib_num_tclassid_users into the ipv4 network namespace.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jun, 2012

1 commit

  • This patch adds the following structure:

    struct netlink_kernel_cfg {
    unsigned int groups;
    void (*input)(struct sk_buff *skb);
    struct mutex *cb_mutex;
    };

    That can be passed to netlink_kernel_create to set optional configurations
    for netlink kernel sockets.

    I've populated this structure by looking for NULL and zero parameters at the
    existing code. The remaining parameters that always need to be set are still
    left in the original interface.

    That includes optional parameters for the netlink socket creation. This allows
    easy extensibility of this interface in the future.

    This patch also adapts all callers to use this new interface.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

29 Jun, 2012

3 commits

  • If rpfilter is off (or the SKB has an IPSEC path) and there are not
    tclassid users, we don't have to do anything at all when
    fib_validate_source() is invoked besides setting the itag to zero.

    We monitor tclassid uses with a counter (modified only under RTNL and
    marked __read_mostly) and we protect the fib_validate_source() real
    work with a test against this counter and whether rpfilter is to be
    done.

    Having a way to know whether we need no tclassid processing or not
    also opens the door for future optimized rpfilter algorithms that do
    not perform full FIB lookups.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Checking for in_dev being NULL is pointless.

    In fact, all of our callers have in_dev precomputed already,
    so just pass it in and remove the NULL checking.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Based upon feedback from Julian Anastasov.

    1) Use route flags to determine multicast/broadcast, not the
    packet flags.

    2) Leave saddr unspecified in flow key.

    3) Adjust how we invoke inet_select_addr(). Pass ip_hdr(skb)->saddr as
    second arg, and if it was zeronet use link scope.

    4) Use loopback as input interface in flow key.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Jun, 2012

2 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • The specific destination is the host we direct unicast replies to.
    Usually this is the original packet source address, but if we are
    responding to a multicast or broadcast packet we have to use something
    different.

    Specifically we must use the source address we would use if we were to
    send a packet to the unicast source of the original packet.

    The routing cache precomputes this value, but we want to remove that
    precomputation because it creates a hard dependency on the expensive
    rpfilter source address validation which we'd like to make cheaper.

    There are only three places where this matters:

    1) ICMP replies.

    2) pktinfo CMSG

    3) IP options

    Now there will be no real users of rt->rt_spec_dst and we can simply
    remove it altogether.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Apr, 2012

1 commit


29 Mar, 2012

1 commit


12 Mar, 2012

1 commit

  • Use a more current kernel messaging style.

    Convert a printk block to print_hex_dump.
    Coalesce formats, align arguments.
    Use %s, __func__ instead of embedding function names.

    Some messages that were prefixed with _close are
    now prefixed with _fini. Some ah4 and esp messages
    are now not prefixed with "ip ".

    The intent of this patch is to later add something like
    #define pr_fmt(fmt) "IPv4: " fmt.
    to standardize the output messages.

    Text size is trivially reduced. (x86-32 allyesconfig)

    $ size net/ipv4/built-in.o*
    text data bss dec hex filename
    887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
    887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

11 Apr, 2011

2 commits

  • The reverse path filter interferes with IPsec subnet-to-subnet tunnels,
    especially when the link to the IPsec peer is on an interface other than
    the one hosting the default route.

    With dynamic routing, where the peer might be reachable through eth0
    today and eth1 tomorrow, it's difficult to keep rp_filter enabled unless
    fake routes to the remote subnets are configured on the interface
    currently used to reach the peer.

    IPsec provides a much stronger anti-spoofing policy than rp_filter, so
    this patch disables the rp_filter for packets with a security path.

    Signed-off-by: Michael Smith
    Signed-off-by: David S. Miller

    Michael Smith
     
  • This makes sk_buff available for other use in fib_validate_source().

    Signed-off-by: Michael Smith
    Signed-off-by: David S. Miller

    Michael Smith
     

31 Mar, 2011

1 commit