03 Jan, 2018

1 commit

  • [ Upstream commit 21b5944350052d2583e82dd59b19a9ba94a007f0 ]

    (I can trivially verify that that idr_remove in cleanup_net happens
    after the network namespace count has dropped to zero --EWB)

    Function get_net_ns_by_id() does not check for net::count
    after it has found a peer in netns_ids idr.

    It may dereference a peer, after its count has already been
    finaly decremented. This leads to double free and memory
    corruption:

    put_net(peer) rtnl_lock()
    atomic_dec_and_test(&peer->count) [count=0] ...
    __put_net(peer) get_net_ns_by_id(net, id)
    spin_lock(&cleanup_list_lock)
    list_add(&net->cleanup_list, &cleanup_list)
    spin_unlock(&cleanup_list_lock)
    queue_work() peer = idr_find(&net->netns_ids, id)
    | get_net(peer) [count=1]
    | ...
    | (use after final put)
    v ...
    cleanup_net() ...
    spin_lock(&cleanup_list_lock) ...
    list_replace_init(&cleanup_list, ..) ...
    spin_unlock(&cleanup_list_lock) ...
    ... ...
    ... put_net(peer)
    ... atomic_dec_and_test(&peer->count) [count=0]
    ... spin_lock(&cleanup_list_lock)
    ... list_add(&net->cleanup_list, &cleanup_list)
    ... spin_unlock(&cleanup_list_lock)
    ... queue_work()
    ... rtnl_unlock()
    rtnl_lock() ...
    for_each_net(tmp) { ...
    id = __peernet2id(tmp, peer) ...
    spin_lock_irq(&tmp->nsid_lock) ...
    idr_remove(&tmp->netns_ids, id) ...
    ... ...
    net_drop_ns() ...
    net_free(peer) ...
    } ...
    |
    v
    cleanup_net()
    ...
    (Second free of peer)

    Also, put_net() on the right cpu may reorder with left's cpu
    list_replace_init(&cleanup_list, ..), and then cleanup_list
    will be corrupted.

    Since cleanup_net() is executed in worker thread, while
    put_net(peer) can happen everywhere, there should be
    enough time for concurrent get_net_ns_by_id() to pick
    the peer up, and the race does not seem to be unlikely.
    The patch fixes the problem in standard way.

    (Also, there is possible problem in peernet2id_alloc(), which requires
    check for net::count under nsid_lock and maybe_get_net(peer), but
    in current stable kernel it's used under rtnl_lock() and it has to be
    safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
    be fixed too. While this is not in stable kernel yet, so I'll send
    a separate message to netdev@ later).

    Cc: Nicolas Dichtel
    Signed-off-by: Kirill Tkhai
    Fixes: 0c7aecd4bde4 "netns: add rtnl cmd to add and get peer netns ids"
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: Eric W. Biederman
    Reviewed-by: Eric Dumazet
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

10 Aug, 2017

2 commits


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

30 Jun, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. This batch contains connection tracking updates for the cleanup
    iteration path, patches from Florian Westphal:

    X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
    dying bit to let the CPU release them.

    X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
    conntrack from all namespace.

    X) Restart iteration on hashtable resizing, since both may occur at
    the same time.

    X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
    mapping on module removal.

    X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
    module removal, from Liping Zhang.

    X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
    if user requests this, also from Liping.

    X) Add net_ns_barrier() and use it from FTP helper, so make sure
    no concurrent namespace removal happens at the same time while
    the helper module is being removed.

    X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
    module size. Same thing in nf_tables.

    Updates for the nf_tables infrastructure:

    X) Prepare usage of the extended ACK reporting infrastructure for
    nf_tables.

    X) Remove unnecessary forward declaration in nf_tables hash set.

    X) Skip set size estimation if number of element is not specified.

    X) Changes to accomodate a (faster) unresizable hash set implementation,
    for anonymous sets and dynamic size fixed sets with no timeouts.

    X) Faster lookup function for unresizable hash table for 2 and 4
    bytes key.

    And, finally, a bunch of asorted small updates and cleanups:

    X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
    to device events and look up for index from the packet path, this
    is fixing an issue that is present since the very beginning, patch
    from Xin Long.

    X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

    X) Use ebt_invalid_target() whenever possible in the ebtables tree,
    from Gao Feng.

    X) Calm down compilation warning in nf_dup infrastructure, patch from
    stephen hemminger.

    X) Statify functions in nftables rt expression, also from stephen.

    X) Update Makefile to use canonical method to specify nf_tables-objs.
    From Jike Song.

    X) Use nf_conntrack_helpers_register() in amanda and H323.

    X) Space cleanup for ctnetlink, from linzhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Jun, 2017

1 commit

  • Quoting Joe Stringer:
    If a user loads nf_conntrack_ftp, sends FTP traffic through a network
    namespace, destroys that namespace then unloads the FTP helper module,
    then the kernel will crash.

    Events that lead to the crash:
    1. conntrack is created with ftp helper in netns x
    2. This netns is destroyed
    3. netns destruction is scheduled
    4. netns destruction wq starts, removes netns from global list
    5. ftp helper is unloaded, which resets all helpers of the conntracks
    via for_each_net()

    but because netns is already gone from list the for_each_net() loop
    doesn't include it, therefore all of these conntracks are unaffected.

    6. helper module unload finishes
    7. netns wq invokes destructor for rmmod'ed helper

    CC: "Eric W. Biederman"
    Reported-by: Joe Stringer
    Signed-off-by: Florian Westphal
    Acked-by: David S. Miller
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

11 Jun, 2017

2 commits


26 May, 2017

1 commit

  • The default value for somaxconn is set in sysctl_core_net_init(), but this
    function is not called when kernel is configured without CONFIG_SYSCTL.

    This results in the kernel not being able to accept TCP connections,
    because the backlog has zero size. Usually, the user ends up with:
    "TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request. Check SNMP counters."
    If SYN cookies are not enabled the connection is rejected.

    Before ef547f2ac16 (tcp: remove max_qlen_log), the effects were less
    severe, because the backlog was always at least eight slots long.

    Signed-off-by: Roman Kapl
    Signed-off-by: David S. Miller

    Roman Kapl
     

01 May, 2017

1 commit

  • Initialise init_net.count to 1 for its pointer from init_nsproxy lest
    someone tries to do a get_net() and a put_net() in a process in which
    current->ns_proxy->net_ns points to the initial network namespace.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

18 Apr, 2017

1 commit

  • Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
    for doit functions that call it directly.

    This is the first step to using extended error reporting in rtnetlink.
    >From here individual subsystems can be updated to set netlink_ext_ack as
    needed.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

14 Apr, 2017

1 commit


02 Mar, 2017

1 commit


15 Dec, 2016

2 commits

  • Pull audit updates from Paul Moore:
    "After the small number of patches for v4.9, we've got a much bigger
    pile for v4.10.

    The bulk of these patches involve a rework of the audit backlog queue
    to enable us to move the netlink multicasting out of the task/thread
    that generates the audit record and into the kernel thread that emits
    the record (just like we do for the audit unicast to auditd).

    While we were playing with the backlog queue(s) we fixed a number of
    other little problems with the code, and from all the testing so far
    things look to be in much better shape now. Doing this also allowed us
    to re-enable disabling IRQs for some netns operations ("netns: avoid
    disabling irq for netns id").

    The remaining patches fix some small problems that are well documented
    in the commit descriptions, as well as adding session ID filtering
    support"

    * 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
    audit: use proper refcount locking on audit_sock
    netns: avoid disabling irq for netns id
    audit: don't ever sleep on a command record/message
    audit: handle a clean auditd shutdown with grace
    audit: wake up kauditd_thread after auditd registers
    audit: rework audit_log_start()
    audit: rework the audit queue handling
    audit: rename the queues and kauditd related functions
    audit: queue netlink multicast sends just like we do for unicast sends
    audit: fixup audit_init()
    audit: move kaudit thread start from auditd registration to kaudit init (#2)
    audit: add support for session ID user filter
    audit: fix formatting of AUDIT_CONFIG_CHANGE events
    audit: skip sessionid sentinel value when auto-incrementing
    audit: tame initialization warning len_abuf in audit_log_execve_info
    audit: less stack usage for /proc/*/loginuid

    Linus Torvalds
     
  • Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
    id") now that we've fixed some audit multicast issues that caused
    problems with original attempt. Additional information, and history,
    can be found in the links below:

    * https://github.com/linux-audit/audit-kernel/issues/22
    * https://github.com/linux-audit/audit-kernel/issues/23

    Signed-off-by: Cong Wang
    Signed-off-by: Paul Moore

    Paul Moore
     

04 Dec, 2016

3 commits

  • net_generic() function is both a) inline and b) used ~600 times.

    It has the following code inside

    ...
    ptr = ng->ptr[id - 1];
    ...

    "id" is never compile time constant so compiler is forced to subtract 1.
    And those decrements or LEA [r32 - 1] instructions add up.

    We also start id'ing from 1 to catch bugs where pernet sybsystem id
    is not initialized and 0. This is quite pointless idea (nothing will
    work or immediate interference with first registered subsystem) in
    general but it hints what needs to be done for code size reduction.

    Namely, overlaying allocation of pointer array and fixed part of
    structure in the beginning and using usual base-0 addressing.

    Ids are just cookies, their exact values do not matter, so lets start
    with 3 on x86_64.

    Code size savings (oh boy): -4.2 KB

    As usual, ignore the initial compiler stupidity part of the table.

    add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
    function old new delta
    tipc_nametbl_insert_publ 1250 1270 +20
    nlmclnt_lookup_host 686 703 +17
    nfsd4_encode_fattr 5930 5941 +11
    nfs_get_client 1050 1061 +11
    register_pernet_operations 333 342 +9
    tcf_mirred_init 843 849 +6
    tcf_bpf_init 1143 1149 +6
    gss_setup_upcall 990 994 +4
    idmap_name_to_id 432 434 +2
    ops_init 274 275 +1
    nfsd_inject_forget_client 259 260 +1
    nfs4_alloc_client 612 613 +1
    tunnel_key_walker 164 163 -1

    ...

    tipc_bcbase_select_primary 392 360 -32
    mac80211_hwsim_new_radio 2808 2767 -41
    ipip6_tunnel_ioctl 2228 2186 -42
    tipc_bcast_rcv 715 672 -43
    tipc_link_build_proto_msg 1140 1089 -51
    nfsd4_lock 3851 3796 -55
    tipc_mon_rcv 1012 956 -56
    Total: Before=156643951, After=156639743, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • This is precursor to fixing "[id - 1]" bloat inside net_generic().

    Name "s" is chosen to complement name "u" often used for dummy unions.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Publishing net_generic pointer is done with silly mistake: new array is
    published BEFORE setting freshly acquired pernet subsystem pointer.

    memcpy
    rcu_assign_pointer
    kfree_rcu
    ng->ptr[id - 1] = data;

    This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
    ("[NETNS]: The generic per-net pointers.") in the glorious days of
    chopping networking stack into containers proper 8.5 years ago (whee...)

    How it didn't trigger for so long?
    Well, you need quite specific set of conditions:

    *) race window opens once per pernet subsystem addition
    (read: modprobe or boot)

    *) not every pernet subsystem is eligible (need ->id and ->size)

    *) not every pernet subsystem is vulnerable (need incorrect or absense
    of ordering of register_pernet_sybsys() and actually using net_generic())

    *) to hide the bug even more, default is to preallocate 13 pointers which
    is actually quite a lot. You need IPv6, netfilter, bridging etc together
    loaded to trigger reallocation in the first place. Trimmed down
    config are OK.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

23 Nov, 2016

1 commit

  • All conflicts were simple overlapping changes except perhaps
    for the Thunder driver.

    That driver has a change_mtu method explicitly for sending
    a message to the hardware. If that fails it returns an
    error.

    Normally a driver doesn't need an ndo_change_mtu method becuase those
    are usually just range changes, which are now handled generically.
    But since this extra operation is needed in the Thunder driver, it has
    to stay.

    However, if the message send fails we have to restore the original
    MTU before the change because the entire call chain expects that if
    an error is thrown by ndo_change_mtu then the MTU did not change.
    Therefore code is added to nicvf_change_mtu to remember the original
    MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Nov, 2016

2 commits

  • Make struct pernet_operations::id unsigned.

    There are 2 reasons to do so:

    1)
    This field is really an index into an zero based array and
    thus is unsigned entity. Using negative value is out-of-bound
    access by definition.

    2)
    On x86_64 unsigned 32-bit data which are mixed with pointers
    via array indexing or offsets added or subtracted to pointers
    are preffered to signed 32-bit data.

    "int" being used as an array index needs to be sign-extended
    to 64-bit before being used.

    void f(long *p, int i)
    {
    g(p[i]);
    }

    roughly translates to

    movsx rsi, esi
    mov rdi, [rsi+...]
    call g

    MOVSX is 3 byte instruction which isn't necessary if the variable is
    unsigned because x86_64 is zero extending by default.

    Now, there is net_generic() function which, you guessed it right, uses
    "int" as an array index:

    static inline void *net_generic(const struct net *net, int id)
    {
    ...
    ptr = ng->ptr[id - 1];
    ...
    }

    And this function is used a lot, so those sign extensions add up.

    Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
    messing with code generation):

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

    Unfortunately some functions actually grow bigger.
    This is a semmingly random artefact of code generation with register
    allocator being used differently. gcc decides that some variable
    needs to live in new r8+ registers and every access now requires REX
    prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
    used which is longer than [r8]

    However, overall balance is in negative direction:

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
    function old new delta
    nfsd4_lock 3886 3959 +73
    tipc_link_build_proto_msg 1096 1140 +44
    mac80211_hwsim_new_radio 2776 2808 +32
    tipc_mon_rcv 1032 1058 +26
    svcauth_gss_legacy_init 1413 1429 +16
    tipc_bcbase_select_primary 379 392 +13
    nfsd4_exchange_id 1247 1260 +13
    nfsd4_setclientid_confirm 782 793 +11
    ...
    put_client_renew_locked 494 480 -14
    ip_set_sockfn_get 730 716 -14
    geneve_sock_add 829 813 -16
    nfsd4_sequence_done 721 703 -18
    nlmclnt_lookup_host 708 686 -22
    nfsd4_lockt 1085 1063 -22
    nfs_get_client 1077 1050 -27
    tcf_bpf_init 1106 1076 -30
    nfsd4_encode_fattr 5997 5930 -67
    Total: Before=154856051, After=154854321, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     
  • Andrei reports we still allocate netns ID from idr after we destroy
    it in cleanup_net().

    cleanup_net():
    ...
    idr_destroy(&net->netns_ids);
    ...
    list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list);
    -> rollback_registered_many()
    -> rtmsg_ifinfo_build_skb()
    -> rtnl_fill_ifinfo()
    -> peernet2id_alloc()

    After that point we should not even access net->netns_ids, we
    should check the death of the current netns as early as we can in
    peernet2id_alloc().

    For net-next we can consider to avoid sending rtmsg totally,
    it is a good optimization for netns teardown path.

    Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
    Reported-by: Andrei Vagin
    Cc: Nicolas Dichtel
    Signed-off-by: Cong Wang
    Acked-by: Andrei Vagin
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    WANG Cong
     

31 Oct, 2016

1 commit


24 Oct, 2016

1 commit

  • net_mutex can be locked for a long time. It may be because many
    namespaces are being destroyed or many processes decide to create
    a network namespace.

    Both these operations are heavy, so it is better to have an ability to
    kill a process which is waiting net_mutex.

    Cc: "David S. Miller"
    Cc: Eric W. Biederman
    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

23 Oct, 2016

1 commit

  • This reverts commit bc51dddf98c9 ("netns: avoid disabling irq for
    netns id") as it was found to cause problems with systems running
    SELinux/audit, see the mailing list thread below:

    * http://marc.info/?t=147694653900002&r=1&w=2

    Eventually we should be able to reintroduce this code once we have
    rewritten the audit multicast code to queue messages much the same
    way we do for unicast messages. A tracking issue for this can be
    found below:

    * https://github.com/linux-audit/audit-kernel/issues/23

    Reported-by: Stephen Smalley
    Reported-by: Elad Raz
    Cc: Cong Wang
    Signed-off-by: Paul Moore
    Signed-off-by: David S. Miller

    Paul Moore
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

24 Sep, 2016

1 commit

  • With the newly enforced limit on the number of namespaces,
    we get a build warning if CONFIG_NETNS is disabled:

    net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
    net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]

    This moves the two added functions inside the #ifdef that guards
    their callers.

    Fixes: 703286608a22 ("netns: Add a limit on the number of net namespaces")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Eric W. Biederman

    Arnd Bergmann
     

23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Sep, 2016

2 commits

  • We never read or change netns id in hardirq context,
    the only place we read netns id in softirq context
    is in vxlan_xmit(). So, it should be enough to just
    disable BH.

    Cc: Nicolas Dichtel
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • netns id should be already allocated each time we change
    netns, that is, in dev_change_net_namespace() (more precisely
    in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.

    Cc: Nicolas Dichtel
    Signed-off-by: Cong Wang
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    WANG Cong
     

02 Sep, 2016

1 commit


15 Aug, 2016

1 commit

  • When CONFIG_NET_NS is disabled, registering pernet operations causes
    init() to be called immediately with init_net as an argument. Unfortunately
    this leads to some pernet ops, such as proc_net_ns_init() to be called too
    early, when init_net namespace has not been fully initialized. This causes
    issues when we want to change pernet ops to use more data from the net
    namespace in question, for example reference user namespace that owns our
    network namespace.

    To fix this we could either play game of musical chairs and rearrange init
    order, or we could do the same as when CONFIG_NET_NS is enabled, and
    postpone calling pernet ops->init() until namespace is set up properly.

    Note that we can not simply undo commit ed160e839d2e ("[NET]: Cleanup
    pernet operation without CONFIG_NET_NS") and use the same implementations
    for __register_pernet_operations() and __unregister_pernet_operations(),
    because many pernet ops are marked as __net_initdata and will be discarded,
    which wreaks havoc on our ops lists. Here we rely on the fact that we only
    use lists until init_net is fully initialized, which happens much earlier
    than discarding __net_initdata sections.

    Signed-off-by: Dmitry Torokhov
    Signed-off-by: David S. Miller

    Dmitry Torokhov
     

09 Aug, 2016

1 commit


18 May, 2015

1 commit

  • The spinlock is used to protect netns_ids which is per net,
    so there is no need to use a global spinlock.

    Cc: Nicolas Dichtel
    Signed-off-by: Cong Wang
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    WANG Cong
     

15 May, 2015

1 commit


14 May, 2015

1 commit

  • Four minor merge conflicts:

    1) qca_spi.c renamed the local variable used for the SPI device
    from spi_device to spi, meanwhile the spi_set_drvdata() call
    got moved further up in the probe function.

    2) Two changes were both adding new members to codel params
    structure, and thus we had overlapping changes to the
    initializer function.

    3) 'net' was making a fix to sk_release_kernel() which is
    completely removed in 'net-next'.

    4) In net_namespace.c, the rtnl_net_fill() call for GET operations
    had the command value fixed, meanwhile 'net-next' adjusted the
    argument signature a bit.

    This also matches example merge resolutions posted by Stephen
    Rothwell over the past two days.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 May, 2015

1 commit


10 May, 2015

2 commits

  • More accurately, listen all netns that have a nsid assigned into the netns
    where the netlink socket is opened.
    For this purpose, a netlink socket option is added:
    NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
    socket will receive netlink notifications from all netns that have a nsid
    assigned into the netns where the socket has been opened. The nsid is sent
    to userland via an anscillary data.

    With this patch, a daemon needs only one socket to listen many netns. This
    is useful when the number of netns is high.

    Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
    the field nsid is valid or not. skb->cb is initialized to 0 on skb
    allocation, thus we are sure that we will never send a nsid 0 by error to
    the userland.

    Signed-off-by: Nicolas Dichtel
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Before this patch, nsid were protected by the rtnl lock. The goal of this
    patch is to be able to find a nsid without needing to hold the rtnl lock.

    The next patch will introduce a netlink socket option to listen to all
    netns that have a nsid assigned into the netns where the socket is opened.
    Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
    avoid a recursive lock (nsid are notified via rtnl). This was the main
    reason of the previous patch.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel