Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

1 commit

92acdc58a bpf, net: Rework cookie generator as per-cpu one ... Browse Code »

With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.

As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Acked-by: Martin KaFai Lau
Cc: Eric Dumazet
Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

Daniel Borkmann
2020-10-01 02:50:35 +0800

08 Sep, 2020

1 commit

e1f469cd5 Revert "netns: don't disable BHs when locking "nsid_lock"" ... Browse Code »

This reverts commit 8d7e5dee972f1cde2ba96c621f1541fa36e7d4f4.

To protect netns id, the nsid_lock is used when netns id is being
allocated and removed by peernet2id_alloc() and unhash_nsid().
The nsid_lock can be used in BH context but only spin_lock() is used
in this code.
Using spin_lock() instead of spin_lock_bh() can result in a deadlock in
the following scenario reported by the lockdep.
In order to avoid a deadlock, the spin_lock_bh() should be used instead
of spin_lock() to acquire nsid_lock.

Test commands:
ip netns del nst
ip netns add nst
ip link add veth1 type veth peer name veth2
ip link set veth1 netns nst
ip netns exec nst ip link add name br1 type bridge vlan_filtering 1
ip netns exec nst ip link set dev br1 up
ip netns exec nst ip link set dev veth1 master br1
ip netns exec nst ip link set dev veth1 up
ip netns exec nst ip link add macvlan0 link br1 up type macvlan

Splat looks like:
[ 33.615860][ T607] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[ 33.617194][ T607] 5.9.0-rc1+ #665 Not tainted
[ ... ]
[ 33.670615][ T607] Chain exists of:
[ 33.670615][ T607] &mc->mca_lock --> &bridge_netdev_addr_lock_key --> &net->nsid_lock
[ 33.670615][ T607]
[ 33.673118][ T607] Possible interrupt unsafe locking scenario:
[ 33.673118][ T607]
[ 33.674599][ T607] CPU0 CPU1
[ 33.675557][ T607] ---- ----
[ 33.676516][ T607] lock(&net->nsid_lock);
[ 33.677306][ T607] local_irq_disable();
[ 33.678517][ T607] lock(&mc->mca_lock);
[ 33.679725][ T607] lock(&bridge_netdev_addr_lock_key);
[ 33.681166][ T607]
[ 33.681791][ T607] lock(&mc->mca_lock);
[ 33.682579][ T607]
[ 33.682579][ T607] *** DEADLOCK ***
[ ... ]
[ 33.922046][ T607] stack backtrace:
[ 33.922999][ T607] CPU: 3 PID: 607 Comm: ip Not tainted 5.9.0-rc1+ #665
[ 33.924099][ T607] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 33.925714][ T607] Call Trace:
[ 33.926238][ T607] dump_stack+0x78/0xab
[ 33.926905][ T607] check_irq_usage+0x70b/0x720
[ 33.927708][ T607] ? iterate_chain_key+0x60/0x60
[ 33.928507][ T607] ? check_path+0x22/0x40
[ 33.929201][ T607] ? check_noncircular+0xcf/0x180
[ 33.930024][ T607] ? __lock_acquire+0x1952/0x1f20
[ 33.930860][ T607] __lock_acquire+0x1952/0x1f20
[ 33.931667][ T607] lock_acquire+0xaf/0x3a0
[ 33.932366][ T607] ? peernet2id_alloc+0x3a/0x170
[ 33.933147][ T607] ? br_port_fill_attrs+0x54c/0x6b0 [bridge]
[ 33.934140][ T607] ? br_port_fill_attrs+0x5de/0x6b0 [bridge]
[ 33.935113][ T607] ? kvm_sched_clock_read+0x14/0x30
[ 33.935974][ T607] _raw_spin_lock+0x30/0x70
[ 33.936728][ T607] ? peernet2id_alloc+0x3a/0x170
[ 33.937523][ T607] peernet2id_alloc+0x3a/0x170
[ 33.938313][ T607] rtnl_fill_ifinfo+0xb5e/0x1400
[ 33.939091][ T607] rtmsg_ifinfo_build_skb+0x8a/0xf0
[ 33.939953][ T607] rtmsg_ifinfo_event.part.39+0x17/0x50
[ 33.940863][ T607] rtmsg_ifinfo+0x1f/0x30
[ 33.941571][ T607] __dev_notify_flags+0xa5/0xf0
[ 33.942376][ T607] ? __irq_work_queue_local+0x49/0x50
[ 33.943249][ T607] ? irq_work_queue+0x1d/0x30
[ 33.943993][ T607] ? __dev_set_promiscuity+0x7b/0x1a0
[ 33.944878][ T607] __dev_set_promiscuity+0x7b/0x1a0
[ 33.945758][ T607] dev_set_promiscuity+0x1e/0x50
[ 33.946582][ T607] br_port_set_promisc+0x1f/0x40 [bridge]
[ 33.947487][ T607] br_manage_promisc+0x8b/0xe0 [bridge]
[ 33.948388][ T607] __dev_set_promiscuity+0x123/0x1a0
[ 33.949244][ T607] __dev_set_rx_mode+0x68/0x90
[ 33.950021][ T607] dev_uc_add+0x50/0x60
[ 33.950720][ T607] macvlan_open+0x18e/0x1f0 [macvlan]
[ 33.951601][ T607] __dev_open+0xd6/0x170
[ 33.952269][ T607] __dev_change_flags+0x181/0x1d0
[ 33.953056][ T607] rtnl_configure_link+0x2f/0xa0
[ 33.953884][ T607] __rtnl_newlink+0x6b9/0x8e0
[ 33.954665][ T607] ? __lock_acquire+0x95d/0x1f20
[ 33.955450][ T607] ? lock_acquire+0xaf/0x3a0
[ 33.956193][ T607] ? is_bpf_text_address+0x5/0xe0
[ 33.956999][ T607] rtnl_newlink+0x47/0x70

Acked-by: Guillaume Nault
Fixes: 8d7e5dee972f ("netns: don't disable BHs when locking "nsid_lock"")
Reported-by: syzbot+3f960c64a104eaa2c813@syzkaller.appspotmail.com
Signed-off-by: Taehee Yoo
Signed-off-by: Jakub Kicinski

Taehee Yoo
2020-09-08 05:32:39 +0800

09 May, 2020

1 commit

f2a8d52e0 nsproxy: add struct nsset ... Browse Code »

Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.

The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.

Signed-off-by: Christian Brauner
Reviewed-by: Serge Hallyn
Cc: Eric W. Biederman
Cc: Serge Hallyn
Cc: Jann Horn
Cc: Michael Kerrisk
Cc: Aleksa Sarai
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

Christian Brauner
2020-05-09 19:57:12 +0800

28 Mar, 2020

1 commit

f318903c0 bpf: Add netns cookie and enable it for bpf cgroup hooks ... Browse Code »

In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net

Daniel Borkmann
2020-03-28 10:40:38 +0800

17 Jan, 2020

1 commit

56f200c78 netns: Constify exported functions ... Browse Code »

Mark function parameters as 'const' where possible.

Signed-off-by: Guillaume Nault
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Guillaume Nault
2020-01-17 20:25:24 +0800

15 Jan, 2020

3 commits

8d7e5dee9 netns: don't disable BHs when locking "nsid_lock" ... Browse Code »

When peernet2id() had to lock "nsid_lock" before iterating through the
nsid table, we had to disable BHs, because VXLAN can call peernet2id()
from the xmit path:
vxlan_xmit() -> vxlan_fdb_miss() -> vxlan_fdb_notify()
-> __vxlan_fdb_notify() -> vxlan_fdb_info() -> peernet2id().

Now that peernet2id() uses RCU protection, "nsid_lock" isn't used in BH
context anymore. Therefore, we can safely use plain
spin_lock()/spin_unlock() and let BHs run when holding "nsid_lock".

Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2020-01-15 03:28:40 +0800
2dce224f4 netns: protect netns ID lookups with RCU ... Browse Code »

__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.

rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.

Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2020-01-15 03:28:40 +0800
490529416 netns: Remove __peernet2id_alloc() ... Browse Code »

__peernet2id_alloc() was used for both plain lookups and for netns ID
allocations (depending the value of '*alloc'). Let's separate lookups
from allocations instead. That is, integrate the lookup code into
__peernet2id() and make peernet2id_alloc() responsible for allocating
new netns IDs when necessary.

This makes it clear that __peernet2id() doesn't modify the idr and
prepares the code for lockless lookups.

Also, mark the 'net' argument of __peernet2id() as 'const', since we're
modifying this line.

Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2020-01-15 03:28:40 +0800

26 Oct, 2019

1 commit

d4e4fdf9e netns: fix GFP flags in rtnl_net_notifyid() ... Browse Code »

In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
but there are a few paths calling rtnl_net_notifyid() from atomic
context or from RCU critical sections. The later also precludes the use
of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
call is wrong too, as it uses GFP_KERNEL unconditionally.

Therefore, we need to pass the GFP flags as parameter and propagate it
through function calls until the proper flags can be determined.

In most cases, GFP_KERNEL is fine. The exceptions are:
* openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
indirectly call rtnl_net_notifyid() from RCU critical section,

* rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
parameter.

Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
by nlmsg_new(). The function is allowed to sleep, so better make the
flags consistent with the ones used in the following
ovs_vport_cmd_fill_info() call.

Found by code inspection.

Fixes: 9a9634545c70 ("netns: notify netns id events")
Signed-off-by: Guillaume Nault
Acked-by: Nicolas Dichtel
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Guillaume Nault
2019-10-26 11:14:42 +0800

25 Oct, 2019

1 commit

82ecff655 keys: Fix memory leak in copy_net_ns ... Browse Code »

If copy_net_ns() failed after net_alloc(), net->key_domain is leaked.
Fix this, by freeing key_domain in error path.

syzbot report:
BUG: memory leak
unreferenced object 0xffff8881175007e0 (size 32):
comm "syz-executor902", pid 7069, jiffies 4294944350 (age 28.400s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
[] slab_post_alloc_hook mm/slab.h:439 [inline]
[] slab_alloc mm/slab.c:3326 [inline]
[] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
[] kmalloc include/linux/slab.h:547 [inline]
[] kzalloc include/linux/slab.h:742 [inline]
[] net_alloc net/core/net_namespace.c:398 [inline]
[] copy_net_ns+0xb2/0x220 net/core/net_namespace.c:445
[] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:103
[] unshare_nsproxy_namespaces+0x7f/0x100 kernel/nsproxy.c:202
[] ksys_unshare+0x236/0x490 kernel/fork.c:2674
[] __do_sys_unshare kernel/fork.c:2742 [inline]
[] __se_sys_unshare kernel/fork.c:2740 [inline]
[] __x64_sys_unshare+0x16/0x20 kernel/fork.c:2740
[] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
[] entry_SYSCALL_64_after_hwframe+0x44/0xa9

syzbot also reported other leak in copy_net_ns -> setup_net.
This problem is already fixed by cf47a0b882a4e5f6b34c7949d7b293e9287f1972.

Fixes: 9b242610514f ("keys: Network namespace domain tag")
Reported-and-tested-by: syzbot+3b3296d032353c33184b@syzkaller.appspotmail.com
Signed-off-by: Takeshi Misawa
Signed-off-by: David S. Miller

Takeshi Misawa
2019-10-25 05:40:02 +0800

10 Oct, 2019

1 commit

993e4c929 netns: fix NLM_F_ECHO mechanism for RTM_NEWNSID ... Browse Code »

The flag NLM_F_ECHO aims to reply to the user the message notified to all
listeners.
It was not the case with the command RTM_NEWNSID, let's fix this.

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Reported-by: Guillaume Nault
Signed-off-by: Nicolas Dichtel
Acked-by: Guillaume Nault
Tested-by: Guillaume Nault
Signed-off-by: Jakub Kicinski

Nicolas Dichtel
2019-10-10 11:58:05 +0800

12 Jul, 2019

1 commit

237f83dfb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Some highlights from this development cycle:

1) Big refactoring of ipv6 route and neigh handling to support
nexthop objects configurable as units from userspace. From David
Ahern.

2) Convert explored_states in BPF verifier into a hash table,
significantly decreased state held for programs with bpf2bpf
calls, from Alexei Starovoitov.

3) Implement bpf_send_signal() helper, from Yonghong Song.

4) Various classifier enhancements to mvpp2 driver, from Maxime
Chevallier.

5) Add aRFS support to hns3 driver, from Jian Shen.

6) Fix use after free in inet frags by allocating fqdirs dynamically
and reworking how rhashtable dismantle occurs, from Eric Dumazet.

7) Add act_ctinfo packet classifier action, from Kevin
Darbyshire-Bryant.

8) Add TFO key backup infrastructure, from Jason Baron.

9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

10) Add devlink notifications for flash update status to mlxsw driver,
from Jiri Pirko.

11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

13) Various enhancements to ipv6 flow label handling, from Eric
Dumazet and Willem de Bruijn.

14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
der Merwe, and others.

15) Various improvements to axienet driver including converting it to
phylink, from Robert Hancock.

16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
Radulescu.

18) Add devlink health reporting to mlx5, from Moshe Shemesh.

19) Convert stmmac over to phylink, from Jose Abreu.

20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
Shalom Toledo.

21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

23) Track spill/fill of constants in BPF verifier, from Alexei
Starovoitov.

24) Support bounded loops in BPF, from Alexei Starovoitov.

25) Various page_pool API fixes and improvements, from Jesper Dangaard
Brouer.

26) Just like ipv4, support ref-countless ipv6 route handling. From
Wei Wang.

27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

29) Add flower GRE encap/decap support to nfp driver, from Pieter
Jansen van Vuuren.

30) Protect against stack overflow when using act_mirred, from John
Hurley.

31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

32) Use page_pool API in netsec driver, Ilias Apalodimas.

33) Add Google gve network driver, from Catherine Sullivan.

34) More indirect call avoidance, from Paolo Abeni.

35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

37) Add MPLS manipulation actions to TC, from John Hurley.

38) Add sending a packet to connection tracking from TC actions, and
then allow flower classifier matching on conntrack state. From
Paul Blakey.

39) Netfilter hw offload support, from Pablo Neira Ayuso"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
net/mlx5e: Return in default case statement in tx_post_resync_params
mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
net: dsa: add support for BRIDGE_MROUTER attribute
pkt_sched: Include const.h
net: netsec: remove static declaration for netsec_set_tx_de()
net: netsec: remove superfluous if statement
netfilter: nf_tables: add hardware offload support
net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
net: flow_offload: add flow_block_cb_is_busy() and use it
net: sched: remove tcf block API
drivers: net: use flow block API
net: sched: use flow block API
net: flow_offload: add flow_block_cb_{priv, incref, decref}()
net: flow_offload: add list handling functions
net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
net: flow_offload: add flow_block_cb_setup_simple()
net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
...

Linus Torvalds
2019-07-12 01:55:49 +0800

09 Jul, 2019

1 commit

c84ca912b Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
"These patches help make keys and keyrings more namespace aware.

Firstly some miscellaneous patches to make the process easier:

- Simplify key index_key handling so that the word-sized chunks
assoc_array requires don't have to be shifted about, making it
easier to add more bits into the key.

- Cache the hash value in the key so that we don't have to calculate
on every key we examine during a search (it involves a bunch of
multiplications).

- Allow keying_search() to search non-recursively.

Then the main patches:

- Make it so that keyring names are per-user_namespace from the point
of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
accessible cross-user_namespace.

keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

- Move the user and user-session keyrings to the user_namespace
rather than the user_struct. This prevents them propagating
directly across user_namespaces boundaries (ie. the KEY_SPEC_*
flags will only pick from the current user_namespace).

- Make it possible to include the target namespace in which the key
shall operate in the index_key. This will allow the possibility of
multiple keys with the same description, but different target
domains to be held in the same keyring.

keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

- Make it so that keys are implicitly invalidated by removal of a
domain tag, causing them to be garbage collected.

- Institute a network namespace domain tag that allows keys to be
differentiated by the network namespace in which they operate. New
keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
the network domain in force when they are created.

- Make it so that the desired network namespace can be handed down
into the request_key() mechanism. This allows AFS, NFS, etc. to
request keys specific to the network namespace of the superblock.

This also means that the keys in the DNS record cache are
thenceforth namespaced, provided network filesystems pass the
appropriate network namespace down into dns_query().

For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
cache keyrings, such as idmapper keyrings, also need to set the
domain tag - for which they need access to the network namespace of
the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
keys: Pass the network namespace into request_key mechanism
keys: Network namespace domain tag
keys: Garbage collect keys for which the domain has been removed
keys: Include target namespace in match criteria
keys: Move the user and user-session keyrings to the user_namespace
keys: Namespace keyring names
keys: Add a 'recurse' flag for keyring searches
keys: Cache the hash value to avoid lots of recalculation
keys: Simplify key description management

Linus Torvalds
2019-07-09 10:36:47 +0800

27 Jun, 2019

1 commit

9b2426105 keys: Network namespace domain tag ... Browse Code »

Create key domain tags for network namespaces and make it possible to
automatically tag keys that are used by networked services (e.g. AF_RXRPC,
AFS, DNS) with the default network namespace if not set by the caller.

This allows keys with the same description but in different namespaces to
coexist within a keyring.

Signed-off-by: David Howells
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org

David Howells
2019-06-27 04:02:33 +0800

23 Jun, 2019

1 commit

b272a0ad7 netns: restore ops before calling ops_exit_list ... Browse Code »

ops has been iterated to first element when call pre_exit, and
it needs to restore from save_ops, not save ops to save_ops

Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations")
Signed-off-by: Li RongQing
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Li RongQing
2019-06-23 07:55:36 +0800

19 Jun, 2019

1 commit

d7d99872c netns: add pre_exit method to struct pernet_operations ... Browse Code »

Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().

There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.

This patch leverages the existing synchronize_rcu() call
in cleanup_net()

Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.

Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.

Tested:

$ time for i in {1..1000}; do unshare -n /bin/false;done

real 0m2.612s
user 0m0.171s
sys 0m2.216s

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-06-19 23:37:47 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

28 Apr, 2019

1 commit

8cb081746 netlink: make validation more configurable for future strictness ... Browse Code »

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2019-04-28 05:07:21 +0800

12 Apr, 2019

1 commit

ecce39ec1 netns: read NETNSA_NSID as s32 attribute in rtnl_net_getid() ... Browse Code »

NETNSA_NSID is signed. Use nla_get_s32() to avoid confusion.

Signed-off-by: Guillaume Nault
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Guillaume Nault
2019-04-12 02:26:27 +0800

29 Mar, 2019

1 commit

355b98553 netns: provide pure entropy for net_hash_mix() ... Browse Code »

net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)

I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.

Also provide entropy regardless of CONFIG_NET_NS.

Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet
Reported-by: Amit Klein
Reported-by: Benny Pinkas
Cc: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric Dumazet
2019-03-29 08:00:45 +0800

20 Jan, 2019

1 commit

4d165f614 net: namespace: perform strict checks also for doit handlers ... Browse Code »

Make RTM_GETNSID's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.

v2: - don't check size >= sizeof(struct rtgenmsg) (Nicolas).

Signed-off-by: Jakub Kicinski
Signed-off-by: David S. Miller

Jakub Kicinski
2019-01-20 02:09:58 +0800

25 Dec, 2018

2 commits

90cadbbf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull in bug fixes before respinning my net-next pull
request.

Signed-off-by: David S. Miller

David S. Miller
2018-12-25 08:19:56 +0800
0eb987c87 net/net_namespace: Check the return value of register_pernet_subsys() ... Browse Code »

In net_ns_init(), register_pernet_subsys() could fail while registering
network namespace subsystems. The fix checks the return value and
sends a panic() on failure.

Signed-off-by: Aditya Pakki
Reviewed-by: Kirill Tkhai
Signed-off-by: David S. Miller

Aditya Pakki
2018-12-25 06:43:27 +0800

28 Nov, 2018

5 commits

288f06a00 netns: enable to dump full nsid translation table ... Browse Code »

Like the previous patch, the goal is to ease to convert nsids from one
netns to another netns.
A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.

Signed-off-by: Nicolas Dichtel
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-11-28 08:20:20 +0800
3a4f68bf6 netns: enable to specify a nsid for a get request ... Browse Code »

Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
netns to a nsid of another netns.
This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
user to interpret a nsid received from an other netns.

Signed-off-by: Nicolas Dichtel
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-11-28 08:20:20 +0800
cff478b9d netns: add support of NETNSA_TARGET_NSID ... Browse Code »

Like it was done for link and address, add the ability to perform get/dump
in another netns by specifying a target nsid attribute.

Signed-off-by: Nicolas Dichtel
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-11-28 08:20:20 +0800
a0732ad14 netns: introduce 'struct net_fill_args' ... Browse Code »

This is a preparatory work. To avoid having to much arguments for the
function rtnl_net_fill(), a new structure is defined.

Signed-off-by: Nicolas Dichtel
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-11-28 08:20:20 +0800
74be39ebb netns: remove net arg from rtnl_net_fill() ... Browse Code »

This argument is not used anymore.

Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()")
Signed-off-by: Nicolas Dichtel
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-11-28 08:20:19 +0800

09 Oct, 2018

1 commit

f80f14c36 net/namespace: Update rtnl_net_dumpid for strict data checking ... Browse Code »

Update rtnl_net_dumpid for strict data checking. If the flag is set,
the dump request is expected to have an rtgenmsg struct as the header
which has the family as the only element. No data may be appended.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2018-10-09 01:39:05 +0800

27 Aug, 2018

1 commit

aba16dc5c Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax ... Browse Code »

Pull IDA updates from Matthew Wilcox:
"A better IDA API:

id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);

rather than the cumbersome ida_simple_get(), ida_simple_remove().

The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.

I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...

Linus Torvalds
2018-08-27 02:48:42 +0800

22 Aug, 2018

1 commit

6e77cc471 Convert net_namespace to new IDA API ... Browse Code »

Signed-off-by: Matthew Wilcox

Matthew Wilcox
2018-08-22 11:54:18 +0800

21 Jul, 2018

1 commit

fbdeaed40 net: create reusable function for getting ownership info of sysfs inodes ... Browse Code »

Make net_ns_get_ownership() reusable by networking code outside of core.
This is useful, for example, to allow bridge related sysfs files to be
owned by container root.

Add a function comment since this is a potentially dangerous function to
use given the way that kobject_get_ownership() works by initializing uid
and gid before calling .get_ownership().

Signed-off-by: Tyler Hicks
Signed-off-by: David S. Miller

Tyler Hicks
2018-07-21 14:44:36 +0800

01 Apr, 2018

1 commit

554873e51 net: Do not take net_rwsem in __rtnl_link_unregister() ... Browse Code »

This function calls call_netdevice_notifier(), which also
may take net_rwsem. So, we can't use net_rwsem here.

This patch makes callers of this functions take pernet_ops_rwsem,
like register_netdevice_notifier() does. This will protect
the modifications of net_namespace_list, and allows notifiers
to take it (they won't have to care about context).

Since __rtnl_link_unregister() is used on module load
and unload (which are not frequent operations), this looks
for me better, than make all call_netdevice_notifier()
always executing in "protected net_namespace_list" context.

Also, this fixes the problem we had a deal in 328fbe747ad4
"Close race between {un, }register_netdevice_notifier and ...",
and guarantees __rtnl_link_unregister() does not skip
exitting net.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-04-01 10:24:58 +0800

30 Mar, 2018

1 commit

f0b07bb15 net: Introduce net_rwsem to protect net_namespace_list ... Browse Code »

rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-30 01:47:53 +0800

28 Mar, 2018

4 commits

8518e9bb9 net: Add more comments ... Browse Code »

This adds comments to different places to improve
readability.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-28 01:18:09 +0800
4420bf21f net: Rename net_sem to pernet_ops_rwsem ... Browse Code »

net_sem is some undefined area name, so it will be better
to make the area more defined.

Rename it to pernet_ops_rwsem for better readability and
better intelligibility.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-28 01:18:09 +0800
2f635ceeb net: Drop pernet_operations::async ... Browse Code »

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-28 01:18:09 +0800
094374e5e net: Reflect all pernet_operations are converted ... Browse Code »

All pernet_operations are reviewed and converted, hooray!
Reflect this in core code: setup_net() and cleanup_net()
will take down_read() always.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-28 01:18:07 +0800

23 Mar, 2018

1 commit

d9ff30497 net: Replace ip_ra_lock with per-net mutex ... Browse Code »

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-23 03:12:56 +0800

07 Mar, 2018

1 commit

30855ffc2 net: Make account struct net to memcg ... Browse Code »

The patch adds SLAB_ACCOUNT to flags of net_cachep cache,
which enables accounting of struct net memory to memcg kmem.
Since number of net_namespaces may be significant, user
want to know, how much there were consumed, and control.

Note, that we do not account net_generic to the same memcg,
where net was accounted, moreover, we don't do this at all (*).
We do not want the situation, when single memcg memory deficit
prevents us to register new pernet_operations.

(*)Even despite there is !current process accounting already
available in linux-next. See kmalloc_memcg() there for the details.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-07 23:19:10 +0800