23 Aug, 2014
1 commit
-
Commit 7a9bc9b81a5b ("ipv4: Elide fib_validate_source() completely when possible.")
introduced a short-circuit to avoid calling fib_validate_source when not
needed. That change took rp_filter into account, but not accept_local.
This resulted in a change of behaviour: with rp_filter and accept_local
off, incoming packets with a local address in the source field should be
dropped.Here is how to reproduce the change pre/post 7a9bc9b81a5b commit:
-configure the same IPv4 address on hosts A and B.
-try to send an ARP request from B to A.
-The ARP request will be dropped before that commit, but accepted and answered
after that commit.This adds a check for ACCEPT_LOCAL, to maintain full
fib validation in case it is 0. We also leave __fib_validate_source() earlier
when possible, based on the same check as fib_validate_source(), once the
accept_local stuff is verified.Cc: Gregory Detal
Cc: Christoph Paasch
Cc: Hannes Frederic Sowa
Cc: Sergei Shtylyov
Signed-off-by: Sébastien Barré
Signed-off-by: David S. Miller
17 Apr, 2014
1 commit
-
As suggested by Julian:
Simply, flowi4_iif must not contain 0, it does not
look logical to ignore all ip rules with specified iif.because in fib_rule_match() we do:
if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
goto out;flowi4_iif should be LOOPBACK_IFINDEX by default.
We need to move LOOPBACK_IFINDEX to include/net/flow.h:
1) It is mostly used by flowi_iif
2) Fix the following compile error if we use it in flow.h
by the patches latter:In file included from include/linux/netfilter.h:277:0,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:21,
from include/linux/netdevice.h:43,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:61,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:27,
from include/linux/nfs_fs.h:30,
from init/do_mounts.c:32:
include/net/flow.h: In function ‘flowi4_init_output’:
include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)Cc: Eric Biederman
Cc: Julian Anastasov
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
25 Mar, 2014
1 commit
-
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it.
Signed-off-by: Li RongQing
Signed-off-by: David S. Miller
25 Jan, 2014
1 commit
-
The two commits 0115e8e30d (net: remove delay at device dismantle) and
748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.Signed-off-by: Oliver Hartkopp
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
19 Oct, 2013
1 commit
-
fib_table_lookup has included the rcu lock protection.
Signed-off-by: baker.zhang
Signed-off-by: David S. Miller
28 Jun, 2013
1 commit
-
Since (c05cdb1 netlink: allow large data transfers from user-space),
netlink splats if it invokes skb_clone on large netlink skbs since:* skb_shared_info was not correctly initialized.
* skb->destructor is not set in the cloned skb.This was spotted by trinity:
[ 894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
[ 894.991034] IP: [] skb_clone+0x24/0xc0
[...]
[ 894.991034] Call Trace:
[ 894.991034] [] nl_fib_input+0x6a/0x240
[ 894.991034] [] ? _raw_read_unlock+0x26/0x40
[ 894.991034] [] netlink_unicast+0x169/0x1e0
[ 894.991034] [] netlink_sendmsg+0x251/0x3d0Fix it by:
1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
that sets our special skb->destructor in the cloned skb. Moreover, handle
the release of the large cloned skb head area in the destructor path.2) not allowing large skbuffs in the netlink broadcast path. I cannot find
any reasonable use of the large data transfer using netlink in that path,
moreover this helps to skip extra skb_clone handling.I found two more netlink clients that are cloning the skbs, but they are
not in the sendmsg path. Therefore, the sole client cloning that I found
seems to be the fib frontend.Thanks to Eric Dumazet for helping to address this issue.
Reported-by: Fengguang Wu
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
29 May, 2013
1 commit
-
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.Signed-off-by: Jiri Pirko
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller
29 Mar, 2013
1 commit
-
Signed-off-by: Hong Zhiguo
Signed-off-by: David S. Miller
22 Mar, 2013
1 commit
-
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller
28 Feb, 2013
1 commit
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jan, 2013
1 commit
-
In fib_frontend.c, there is a confusing comment; NETLINK_CB(skb).portid does not
refer to a pid of sending process, but rather to a netlink portid.Signed-off-by: Rami Rosen
Signed-off-by: David S. Miller
19 Nov, 2012
3 commits
-
- Only allow moving network devices to network namespaces you have
CAP_NET_ADMIN privileges over.- Enable creating/deleting/modifying interfaces
- Enable adding/deleting addresses
- Enable adding/setting/deleting neighbour entries
- Enable adding/removing routes
- Enable adding/removing fib rules
- Enable setting the forwarding state
- Enable adding/removing ipv6 address labels
- Enable setting bridge parameterSigned-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller -
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.In general policy and network stack state changes are allowed
while resource control is left unchanged.Allow creating raw sockets.
Allow the SIOCSARP ioctl to control the arp cache.
Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting gre tunnels.Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipip tunnels.Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipsec virtual tunnel interfaces.Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
sockets.Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
arbitrary ip options.Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
Allow setting the IP_TRANSPARENT ipv4 socket option.
Allow setting the TCP_REPAIR socket option.
Allow setting the TCP_CONGESTION socket option.Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller -
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
09 Oct, 2012
1 commit
-
After "Cache input routes in fib_info nexthops" (commit
d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
(commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
the same fib_info can be used for traffic that is not redirected,
eg. from other input devices or from sources that are not in same subnet.As result, we have to disable the caching of RTCF_DOREDIRECT
flag and to force source validation for the case when forwarding
traffic to the input device. If traffic comes from directly connected
source we allow redirection as it was done before both changes.Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS
is disabled, this can avoid source address validation and to
help caching the routes.After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
contain daddr instead of 0.0.0.0 when target is directly connected.Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller
11 Sep, 2012
1 commit
-
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman"
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller
09 Sep, 2012
1 commit
-
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
08 Sep, 2012
1 commit
-
Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.Signed-off-by: Nicolas Dichtel
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
24 Aug, 2012
1 commit
-
Eric Biederman pointed out that not holding RTNL while calling
call_netdevice_notifiers() was racy.This patch is a direct transcription his feedback
against commit 0115e8e30d6fc (net: remove delay at device dismantle)Thanks Eric !
Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Mahesh Bandewar
Cc: "Eric W. Biederman"
Cc: Gao feng
Acked-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
23 Aug, 2012
1 commit
-
I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.These call_rcu() were posted by rt_free(struct rtable *rt) calls.
We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.Use NETDEV_UNREGISTER_FINAL with care !
Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.With help from Gao feng
Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Mahesh Bandewar
Cc: "Eric W. Biederman"
Cc: Gao feng
Signed-off-by: David S. Miller
10 Aug, 2012
1 commit
-
As pointed out, there are places, that access net->loopback_dev->ifindex
and after ifindex generation is made per-net this value becomes constant
equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
it where appropriate.Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
01 Aug, 2012
1 commit
-
When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.If a route is uncached, we currently have no way of accomplishing
this.So create a global list that is scanned when a network device goes
down. This mirrors the logic in net/core/dst.c's dst_ifdown().Signed-off-by: David S. Miller
24 Jul, 2012
1 commit
-
It is redundant to set no_addr and accept_local to 0 and then set them
with other values just after that.Signed-off-by: Lin Ming
Signed-off-by: David S. Miller
21 Jul, 2012
1 commit
-
The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables. The former of which is
controllable by external entitites.Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.Signed-off-by: David S. Miller
19 Jul, 2012
1 commit
-
ip_options_compile can be called for forwarded packets,
make sure the specific-destionation address is a local one as
specified in RFC 1812, 4.2.2.2 Addresses in OptionsSigned-off-by: Julian Anastasov
Signed-off-by: David S. Miller
13 Jul, 2012
1 commit
-
We only use it to fetch the rule's tclassid, so just store the
tclassid there instead.This also decreases the size of fib_result by a full 8 bytes on
64-bit. On 32-bits it's a wash.Signed-off-by: David S. Miller
06 Jul, 2012
1 commit
-
If the user hasn't actually installed any custom rules, or fiddled
with the default ones, don't go through the whole FIB rules layer.It's just pure overhead.
Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check
the individual tables by hand, one by one.Also, move fib_num_tclassid_users into the ipv4 network namespace.
Signed-off-by: David S. Miller
30 Jun, 2012
1 commit
-
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
29 Jun, 2012
3 commits
-
If rpfilter is off (or the SKB has an IPSEC path) and there are not
tclassid users, we don't have to do anything at all when
fib_validate_source() is invoked besides setting the itag to zero.We monitor tclassid uses with a counter (modified only under RTNL and
marked __read_mostly) and we protect the fib_validate_source() real
work with a test against this counter and whether rpfilter is to be
done.Having a way to know whether we need no tclassid processing or not
also opens the door for future optimized rpfilter algorithms that do
not perform full FIB lookups.Signed-off-by: David S. Miller
-
Checking for in_dev being NULL is pointless.
In fact, all of our callers have in_dev precomputed already,
so just pass it in and remove the NULL checking.Signed-off-by: David S. Miller
-
Based upon feedback from Julian Anastasov.
1) Use route flags to determine multicast/broadcast, not the
packet flags.2) Leave saddr unspecified in flow key.
3) Adjust how we invoke inet_select_addr(). Pass ip_hdr(skb)->saddr as
second arg, and if it was zeronet use link scope.4) Use loopback as input interface in flow key.
Signed-off-by: David S. Miller
28 Jun, 2012
2 commits
-
Signed-off-by: David S. Miller
-
The specific destination is the host we direct unicast replies to.
Usually this is the original packet source address, but if we are
responding to a multicast or broadcast packet we have to use something
different.Specifically we must use the source address we would use if we were to
send a packet to the unicast source of the original packet.The routing cache precomputes this value, but we want to remove that
precomputation because it creates a hard dependency on the expensive
rpfilter source address validation which we'd like to make cheaper.There are only three places where this matters:
1) ICMP replies.
2) pktinfo CMSG
3) IP options
Now there will be no real users of rt->rt_spec_dst and we can simply
remove it altogether.Signed-off-by: David S. Miller
16 Apr, 2012
1 commit
-
Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
29 Mar, 2012
1 commit
-
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`
Signed-off-by: David Howells
12 Mar, 2012
1 commit
-
Use a more current kernel messaging style.
Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.Some messages that were prefixed with _close are
now prefixed with _fini. Some ah4 and esp messages
are now not prefixed with "ip ".The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.Text size is trivially reduced. (x86-32 allyesconfig)
$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.oldSigned-off-by: Joe Perches
Signed-off-by: David S. Miller
10 Jun, 2011
1 commit
-
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.Signed-off-by: Greg Rose
Signed-off-by: Jeff Kirsher
11 Apr, 2011
2 commits
-
The reverse path filter interferes with IPsec subnet-to-subnet tunnels,
especially when the link to the IPsec peer is on an interface other than
the one hosting the default route.With dynamic routing, where the peer might be reachable through eth0
today and eth1 tomorrow, it's difficult to keep rp_filter enabled unless
fake routes to the remote subnets are configured on the interface
currently used to reach the peer.IPsec provides a much stronger anti-spoofing policy than rp_filter, so
this patch disables the rp_filter for packets with a security path.Signed-off-by: Michael Smith
Signed-off-by: David S. Miller -
This makes sk_buff available for other use in fib_validate_source().
Signed-off-by: Michael Smith
Signed-off-by: David S. Miller
31 Mar, 2011
1 commit
-
Daniel J Blueman reported a lockdep splat in trie_firstleaf(), caused by
RTNL being not locked before a call to fib_table_flush()Reported-by: Daniel J Blueman
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller