01 Oct, 2020

1 commit

  • With its use in BPF, the cookie generator can be called very frequently
    in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
    and attached to the root cgroup, for example, when used in v1/v2 mixed
    environments. In particular, when there's a high churn on sockets in the
    system there can be many parallel requests to the bpf_get_socket_cookie()
    and bpf_get_netns_cookie() helpers which then cause contention on the
    atomic counter.

    As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
    allocator"), add a small helper library that both can use for the 64 bit
    counters. Given this can be called from different contexts, we also need
    to deal with potential nested calls even though in practice they are
    considered extremely rare. One idea as suggested by Eric Dumazet was
    to use a reverse counter for this situation since we don't expect 64 bit
    overflows anyways; that way, we can avoid bigger gaps in the 64 bit
    counter space compared to just batch-wise increase. Even on machines
    with small number of cores (e.g. 4) the cookie generation shrinks from
    min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
    in parallel from multiple CPUs.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

    Daniel Borkmann
     

08 Sep, 2020

1 commit

  • This reverts commit 8d7e5dee972f1cde2ba96c621f1541fa36e7d4f4.

    To protect netns id, the nsid_lock is used when netns id is being
    allocated and removed by peernet2id_alloc() and unhash_nsid().
    The nsid_lock can be used in BH context but only spin_lock() is used
    in this code.
    Using spin_lock() instead of spin_lock_bh() can result in a deadlock in
    the following scenario reported by the lockdep.
    In order to avoid a deadlock, the spin_lock_bh() should be used instead
    of spin_lock() to acquire nsid_lock.

    Test commands:
    ip netns del nst
    ip netns add nst
    ip link add veth1 type veth peer name veth2
    ip link set veth1 netns nst
    ip netns exec nst ip link add name br1 type bridge vlan_filtering 1
    ip netns exec nst ip link set dev br1 up
    ip netns exec nst ip link set dev veth1 master br1
    ip netns exec nst ip link set dev veth1 up
    ip netns exec nst ip link add macvlan0 link br1 up type macvlan

    Splat looks like:
    [ 33.615860][ T607] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
    [ 33.617194][ T607] 5.9.0-rc1+ #665 Not tainted
    [ ... ]
    [ 33.670615][ T607] Chain exists of:
    [ 33.670615][ T607] &mc->mca_lock --> &bridge_netdev_addr_lock_key --> &net->nsid_lock
    [ 33.670615][ T607]
    [ 33.673118][ T607] Possible interrupt unsafe locking scenario:
    [ 33.673118][ T607]
    [ 33.674599][ T607] CPU0 CPU1
    [ 33.675557][ T607] ---- ----
    [ 33.676516][ T607] lock(&net->nsid_lock);
    [ 33.677306][ T607] local_irq_disable();
    [ 33.678517][ T607] lock(&mc->mca_lock);
    [ 33.679725][ T607] lock(&bridge_netdev_addr_lock_key);
    [ 33.681166][ T607]
    [ 33.681791][ T607] lock(&mc->mca_lock);
    [ 33.682579][ T607]
    [ 33.682579][ T607] *** DEADLOCK ***
    [ ... ]
    [ 33.922046][ T607] stack backtrace:
    [ 33.922999][ T607] CPU: 3 PID: 607 Comm: ip Not tainted 5.9.0-rc1+ #665
    [ 33.924099][ T607] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    [ 33.925714][ T607] Call Trace:
    [ 33.926238][ T607] dump_stack+0x78/0xab
    [ 33.926905][ T607] check_irq_usage+0x70b/0x720
    [ 33.927708][ T607] ? iterate_chain_key+0x60/0x60
    [ 33.928507][ T607] ? check_path+0x22/0x40
    [ 33.929201][ T607] ? check_noncircular+0xcf/0x180
    [ 33.930024][ T607] ? __lock_acquire+0x1952/0x1f20
    [ 33.930860][ T607] __lock_acquire+0x1952/0x1f20
    [ 33.931667][ T607] lock_acquire+0xaf/0x3a0
    [ 33.932366][ T607] ? peernet2id_alloc+0x3a/0x170
    [ 33.933147][ T607] ? br_port_fill_attrs+0x54c/0x6b0 [bridge]
    [ 33.934140][ T607] ? br_port_fill_attrs+0x5de/0x6b0 [bridge]
    [ 33.935113][ T607] ? kvm_sched_clock_read+0x14/0x30
    [ 33.935974][ T607] _raw_spin_lock+0x30/0x70
    [ 33.936728][ T607] ? peernet2id_alloc+0x3a/0x170
    [ 33.937523][ T607] peernet2id_alloc+0x3a/0x170
    [ 33.938313][ T607] rtnl_fill_ifinfo+0xb5e/0x1400
    [ 33.939091][ T607] rtmsg_ifinfo_build_skb+0x8a/0xf0
    [ 33.939953][ T607] rtmsg_ifinfo_event.part.39+0x17/0x50
    [ 33.940863][ T607] rtmsg_ifinfo+0x1f/0x30
    [ 33.941571][ T607] __dev_notify_flags+0xa5/0xf0
    [ 33.942376][ T607] ? __irq_work_queue_local+0x49/0x50
    [ 33.943249][ T607] ? irq_work_queue+0x1d/0x30
    [ 33.943993][ T607] ? __dev_set_promiscuity+0x7b/0x1a0
    [ 33.944878][ T607] __dev_set_promiscuity+0x7b/0x1a0
    [ 33.945758][ T607] dev_set_promiscuity+0x1e/0x50
    [ 33.946582][ T607] br_port_set_promisc+0x1f/0x40 [bridge]
    [ 33.947487][ T607] br_manage_promisc+0x8b/0xe0 [bridge]
    [ 33.948388][ T607] __dev_set_promiscuity+0x123/0x1a0
    [ 33.949244][ T607] __dev_set_rx_mode+0x68/0x90
    [ 33.950021][ T607] dev_uc_add+0x50/0x60
    [ 33.950720][ T607] macvlan_open+0x18e/0x1f0 [macvlan]
    [ 33.951601][ T607] __dev_open+0xd6/0x170
    [ 33.952269][ T607] __dev_change_flags+0x181/0x1d0
    [ 33.953056][ T607] rtnl_configure_link+0x2f/0xa0
    [ 33.953884][ T607] __rtnl_newlink+0x6b9/0x8e0
    [ 33.954665][ T607] ? __lock_acquire+0x95d/0x1f20
    [ 33.955450][ T607] ? lock_acquire+0xaf/0x3a0
    [ 33.956193][ T607] ? is_bpf_text_address+0x5/0xe0
    [ 33.956999][ T607] rtnl_newlink+0x47/0x70

    Acked-by: Guillaume Nault
    Fixes: 8d7e5dee972f ("netns: don't disable BHs when locking "nsid_lock"")
    Reported-by: syzbot+3f960c64a104eaa2c813@syzkaller.appspotmail.com
    Signed-off-by: Taehee Yoo
    Signed-off-by: Jakub Kicinski

    Taehee Yoo
     

09 May, 2020

1 commit

  • Add a simple struct nsset. It holds all necessary pieces to switch to a new
    set of namespaces without leaving a task in a half-switched state which we
    will make use of in the next patch. This patch switches the existing setns
    logic over without causing a change in setns() behavior. This brings
    setns() closer to how unshare() works(). The prepare_ns() function is
    responsible to prepare all necessary information. This has two reasons.
    First it minimizes dependencies between individual namespaces, i.e. all
    install handler can expect that all fields are properly initialized
    independent in what order they are called in. Second, this makes the code
    easier to maintain and easier to follow if it needs to be changed.

    The prepare_ns() helper will only be switched over to use a flags argument
    in the next patch. Here it will still use nstype as a simple integer
    argument which was argued would be clearer. I'm not particularly
    opinionated about this if it really helps or not. The struct nsset itself
    already contains the flags field since its name already indicates that it
    can contain information required by different namespaces. None of this
    should have functional consequences.

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

    Christian Brauner
     

28 Mar, 2020

1 commit

  • In Cilium we're mainly using BPF cgroup hooks today in order to implement
    kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
    ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
    between Cilium managed nodes. While this works in its current shape and avoids
    packet-level NAT for inter Cilium managed node traffic, there is one major
    limitation we're facing today, that is, lack of netns awareness.

    In Kubernetes, the concept of Pods (which hold one or multiple containers)
    has been built around network namespaces, so while we can use the global scope
    of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
    NodePort ports on loopback addresses), we also have the need to differentiate
    between initial network namespaces and non-initial one. For example, ExternalIP
    services mandate that non-local service IPs are not to be translated from the
    host (initial) network namespace as one example. Right now, we have an ugly
    work-around in place where non-local service IPs for ExternalIP services are
    not xlated from connect() and friends BPF hooks but instead via less efficient
    packet-level NAT on the veth tc ingress hook for Pod traffic.

    On top of determining whether we're in initial or non-initial network namespace
    we also have a need for a socket-cookie like mechanism for network namespaces
    scope. Socket cookies have the nice property that they can be combined as part
    of the key structure e.g. for BPF LRU maps without having to worry that the
    cookie could be recycled. We are planning to use this for our sessionAffinity
    implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
    which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
    provide the cookie for the initial network namespace while passing the context
    instead of NULL would provide the cookie from the application's network namespace.
    We're using a hole, so no size increase; the assignment happens only once.
    Therefore this allows for a comparison on initial namespace as well as regular
    cookie usage as we have today with socket cookies. We could later on enable
    this helper for other program types as well as we would see need.

    (*) Both externalTrafficPolicy={Local|Cluster} types
    [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net

    Daniel Borkmann
     

17 Jan, 2020

1 commit


15 Jan, 2020

3 commits

  • When peernet2id() had to lock "nsid_lock" before iterating through the
    nsid table, we had to disable BHs, because VXLAN can call peernet2id()
    from the xmit path:
    vxlan_xmit() -> vxlan_fdb_miss() -> vxlan_fdb_notify()
    -> __vxlan_fdb_notify() -> vxlan_fdb_info() -> peernet2id().

    Now that peernet2id() uses RCU protection, "nsid_lock" isn't used in BH
    context anymore. Therefore, we can safely use plain
    spin_lock()/spin_unlock() and let BHs run when holding "nsid_lock".

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • __peernet2id() can be protected by RCU as it only calls idr_for_each(),
    which is RCU-safe, and never modifies the nsid table.

    rtnl_net_dumpid() can also do lockless lookups. It does two nested
    idr_for_each() calls on nsid tables (one direct call and one indirect
    call because of rtnl_net_dumpid_one() calling __peernet2id()). The
    netnsid tables are never updated. Therefore it is safe to not take the
    nsid_lock and run within an RCU-critical section instead.

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • __peernet2id_alloc() was used for both plain lookups and for netns ID
    allocations (depending the value of '*alloc'). Let's separate lookups
    from allocations instead. That is, integrate the lookup code into
    __peernet2id() and make peernet2id_alloc() responsible for allocating
    new netns IDs when necessary.

    This makes it clear that __peernet2id() doesn't modify the idr and
    prepares the code for lockless lookups.

    Also, mark the 'net' argument of __peernet2id() as 'const', since we're
    modifying this line.

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     

26 Oct, 2019

1 commit

  • In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
    rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
    but there are a few paths calling rtnl_net_notifyid() from atomic
    context or from RCU critical sections. The later also precludes the use
    of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
    call is wrong too, as it uses GFP_KERNEL unconditionally.

    Therefore, we need to pass the GFP flags as parameter and propagate it
    through function calls until the proper flags can be determined.

    In most cases, GFP_KERNEL is fine. The exceptions are:
    * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
    indirectly call rtnl_net_notifyid() from RCU critical section,

    * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
    parameter.

    Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
    by nlmsg_new(). The function is allowed to sleep, so better make the
    flags consistent with the ones used in the following
    ovs_vport_cmd_fill_info() call.

    Found by code inspection.

    Fixes: 9a9634545c70 ("netns: notify netns id events")
    Signed-off-by: Guillaume Nault
    Acked-by: Nicolas Dichtel
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Guillaume Nault
     

25 Oct, 2019

1 commit

  • If copy_net_ns() failed after net_alloc(), net->key_domain is leaked.
    Fix this, by freeing key_domain in error path.

    syzbot report:
    BUG: memory leak
    unreferenced object 0xffff8881175007e0 (size 32):
    comm "syz-executor902", pid 7069, jiffies 4294944350 (age 28.400s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [] slab_post_alloc_hook mm/slab.h:439 [inline]
    [] slab_alloc mm/slab.c:3326 [inline]
    [] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    [] kmalloc include/linux/slab.h:547 [inline]
    [] kzalloc include/linux/slab.h:742 [inline]
    [] net_alloc net/core/net_namespace.c:398 [inline]
    [] copy_net_ns+0xb2/0x220 net/core/net_namespace.c:445
    [] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:103
    [] unshare_nsproxy_namespaces+0x7f/0x100 kernel/nsproxy.c:202
    [] ksys_unshare+0x236/0x490 kernel/fork.c:2674
    [] __do_sys_unshare kernel/fork.c:2742 [inline]
    [] __se_sys_unshare kernel/fork.c:2740 [inline]
    [] __x64_sys_unshare+0x16/0x20 kernel/fork.c:2740
    [] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    syzbot also reported other leak in copy_net_ns -> setup_net.
    This problem is already fixed by cf47a0b882a4e5f6b34c7949d7b293e9287f1972.

    Fixes: 9b242610514f ("keys: Network namespace domain tag")
    Reported-and-tested-by: syzbot+3b3296d032353c33184b@syzkaller.appspotmail.com
    Signed-off-by: Takeshi Misawa
    Signed-off-by: David S. Miller

    Takeshi Misawa
     

10 Oct, 2019

1 commit

  • The flag NLM_F_ECHO aims to reply to the user the message notified to all
    listeners.
    It was not the case with the command RTM_NEWNSID, let's fix this.

    Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
    Reported-by: Guillaume Nault
    Signed-off-by: Nicolas Dichtel
    Acked-by: Guillaume Nault
    Tested-by: Guillaume Nault
    Signed-off-by: Jakub Kicinski

    Nicolas Dichtel
     

12 Jul, 2019

1 commit

  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

09 Jul, 2019

1 commit

  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     

27 Jun, 2019

1 commit

  • Create key domain tags for network namespaces and make it possible to
    automatically tag keys that are used by networked services (e.g. AF_RXRPC,
    AFS, DNS) with the default network namespace if not set by the caller.

    This allows keys with the same description but in different namespaces to
    coexist within a keyring.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: linux-afs@lists.infradead.org

    David Howells
     

23 Jun, 2019

1 commit

  • ops has been iterated to first element when call pre_exit, and
    it needs to restore from save_ops, not save ops to save_ops

    Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations")
    Signed-off-by: Li RongQing
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Li RongQing
     

19 Jun, 2019

1 commit

  • Current struct pernet_operations exit() handlers are highly
    discouraged to call synchronize_rcu().

    There are cases where we need them, and exit_batch() does
    not help the common case where a single netns is dismantled.

    This patch leverages the existing synchronize_rcu() call
    in cleanup_net()

    Calling optional ->pre_exit() method before ->exit() or
    ->exit_batch() allows to benefit from a single synchronize_rcu()
    call.

    Note that the synchronize_rcu() calls added in this patch
    are only in error paths or slow paths.

    Tested:

    $ time for i in {1..1000}; do unshare -n /bin/false;done

    real 0m2.612s
    user 0m0.171s
    sys 0m2.216s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

1 commit

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

12 Apr, 2019

1 commit


29 Mar, 2019

1 commit

  • net_hash_mix() currently uses kernel address of a struct net,
    and is used in many places that could be used to reveal this
    address to a patient attacker, thus defeating KASLR, for
    the typical case (initial net namespace, &init_net is
    not dynamically allocated)

    I believe the original implementation tried to avoid spending
    too many cycles in this function, but security comes first.

    Also provide entropy regardless of CONFIG_NET_NS.

    Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes")
    Signed-off-by: Eric Dumazet
    Reported-by: Amit Klein
    Reported-by: Benny Pinkas
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jan, 2019

1 commit


25 Dec, 2018

2 commits


28 Nov, 2018

5 commits


09 Oct, 2018

1 commit


27 Aug, 2018

1 commit

  • Pull IDA updates from Matthew Wilcox:
    "A better IDA API:

    id = ida_alloc(ida, GFP_xxx);
    ida_free(ida, id);

    rather than the cumbersome ida_simple_get(), ida_simple_remove().

    The new IDA API is similar to ida_simple_get() but better named. The
    internal restructuring of the IDA code removes the bitmap
    preallocation nonsense.

    I hope the net -200 lines of code is convincing"

    * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
    ida: Change ida_get_new_above to return the id
    ida: Remove old API
    test_ida: check_ida_destroy and check_ida_alloc
    test_ida: Convert check_ida_conv to new API
    test_ida: Move ida_check_max
    test_ida: Move ida_check_leaf
    idr-test: Convert ida_check_nomem to new API
    ida: Start new test_ida module
    target/iscsi: Allocate session IDs from an IDA
    iscsi target: fix session creation failure handling
    drm/vmwgfx: Convert to new IDA API
    dmaengine: Convert to new IDA API
    ppc: Convert vas ID allocation to new IDA API
    media: Convert entity ID allocation to new IDA API
    ppc: Convert mmu context allocation to new IDA API
    Convert net_namespace to new IDA API
    cb710: Convert to new IDA API
    rsxx: Convert to new IDA API
    osd: Convert to new IDA API
    sd: Convert to new IDA API
    ...

    Linus Torvalds
     

22 Aug, 2018

1 commit


21 Jul, 2018

1 commit

  • Make net_ns_get_ownership() reusable by networking code outside of core.
    This is useful, for example, to allow bridge related sysfs files to be
    owned by container root.

    Add a function comment since this is a potentially dangerous function to
    use given the way that kobject_get_ownership() works by initializing uid
    and gid before calling .get_ownership().

    Signed-off-by: Tyler Hicks
    Signed-off-by: David S. Miller

    Tyler Hicks
     

01 Apr, 2018

1 commit

  • This function calls call_netdevice_notifier(), which also
    may take net_rwsem. So, we can't use net_rwsem here.

    This patch makes callers of this functions take pernet_ops_rwsem,
    like register_netdevice_notifier() does. This will protect
    the modifications of net_namespace_list, and allows notifiers
    to take it (they won't have to care about context).

    Since __rtnl_link_unregister() is used on module load
    and unload (which are not frequent operations), this looks
    for me better, than make all call_netdevice_notifier()
    always executing in "protected net_namespace_list" context.

    Also, this fixes the problem we had a deal in 328fbe747ad4
    "Close race between {un, }register_netdevice_notifier and ...",
    and guarantees __rtnl_link_unregister() does not skip
    exitting net.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

30 Mar, 2018

1 commit

  • rtnl_lock() is used everywhere, and contention is very high.
    When someone wants to iterate over alive net namespaces,
    he/she has no a possibility to do that without exclusive lock.
    But the exclusive rtnl_lock() in such places is overkill,
    and it just increases the contention. Yes, there is already
    for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
    and this can't be sleepable. Also, sometimes it may be need
    really prevent net_namespace_list growth, so for_each_net_rcu()
    is not fit there.

    This patch introduces new rw_semaphore, which will be used
    instead of rtnl_mutex to protect net_namespace_list. It is
    sleepable and allows not-exclusive iterations over net
    namespaces list. It allows to stop using rtnl_lock()
    in several places (what is made in next patches) and makes
    less the time, we keep rtnl_mutex. Here we just add new lock,
    while the explanation of we can remove rtnl_lock() there are
    in next patches.

    Fine grained locks generally are better, then one big lock,
    so let's do that with net_namespace_list, while the situation
    allows that.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

28 Mar, 2018

4 commits


23 Mar, 2018

1 commit


07 Mar, 2018

1 commit

  • The patch adds SLAB_ACCOUNT to flags of net_cachep cache,
    which enables accounting of struct net memory to memcg kmem.
    Since number of net_namespaces may be significant, user
    want to know, how much there were consumed, and control.

    Note, that we do not account net_generic to the same memcg,
    where net was accounted, moreover, we don't do this at all (*).
    We do not want the situation, when single memcg memory deficit
    prevents us to register new pernet_operations.

    (*)Even despite there is !current process accounting already
    available in linux-next. See kmalloc_memcg() there for the details.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai