03 Mar, 2009

1 commit

  • It turns out that net_alive is unnecessary, and the original problem
    that led to it being added was simply that the icmp code thought
    it was a network device and wound up being unable to handle packets
    while there were still packets in the network namespace.

    Now that icmp and tcp have been fixed to properly register themselves
    this problem is no longer present and we have a stronger guarantee
    that packets will not arrive in a network namespace then that provided
    by net_alive in netif_receive_skb. So remove net_alive allowing
    packet reception run a little faster.

    Additionally document the strong reason why network namespace cleanup
    is safe so that if something happens again someone else will have
    a chance of figuring it out.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

26 Nov, 2008

1 commit


12 Nov, 2008

1 commit

  • This patch introduces two helpers that deal with reading and writing
    struct net pointers in various network structures.

    Their implementation depends on CONFIG_NET_NS

    For symmetry, both functions work with "struct net **pnet".

    Their usage should reduce the number of #ifdef CONFIG_NET_NS,
    without adding many helpers for each network structure
    that hold a "struct net *pointer"

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Oct, 2008

1 commit

  • netns ops which are registered with register_pernet_gen_device() are
    shutdown strictly before those which are registered with
    register_pernet_subsys(). Sometimes this leads to opposite (read: buggy)
    shutdown ordering between two modules.

    Add register_pernet_gen_subsys()/unregister_pernet_gen_subsys() for modules
    which aren't elite enough for entry in struct net, and which can't use
    register_pernet_gen_device(). PPTP conntracking module is such one.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

08 Oct, 2008

1 commit


27 Jul, 2008

1 commit

  • New object: set of sysctls [currently - root and per-net-ns].
    Contains: pointer to parent set, list of tables and "should I see this set?"
    method (->is_seen(set)).
    Current lists of tables are subsumed by that; net-ns contains such a beast.
    ->lookup() for ctl_table_root returns pointer to ctl_table_set instead of
    that to ->list of that ctl_table_set.

    [folded compile fixes by rdd for configs without sysctl]

    Signed-off-by: Al Viro

    Al Viro
     

18 Jul, 2008

1 commit

  • The only structure declared within is the netns_mib, which will
    carry all our mibs within. I didn't put the mibs in the existing
    netns_xxx structures to make it possible to mark this one as
    properly aligned and get in a separate "read-mostly" cache-line.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

28 Jun, 2008

1 commit


21 Jun, 2008

1 commit

  • Alexey Dobriyan writes:
    > Subject: ICMP sockets destruction vs ICMP packets oops

    > After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
    > icmp_reply() wants ICMP socket.
    >
    > Steps to reproduce:
    >
    > launch shell in new netns
    > move real NIC to netns
    > setup routing
    > ping -i 0
    > exit from shell
    >
    > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    > IP: [] icmp_sk+0x17/0x30
    > PGD 17f3cd067 PUD 17f3ce067 PMD 0
    > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
    > CPU 0
    > Modules linked in: usblp usbcore
    > Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
    > RIP: 0010:[] [] icmp_sk+0x17/0x30
    > RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
    > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
    > RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
    > RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
    > R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
    > R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
    > FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    > CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
    > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
    > Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
    > ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
    > 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
    > Call Trace:
    > [] icmp_reply+0x44/0x1e0
    > [] ? ip_route_input+0x23a/0x1360
    > [] icmp_echo+0x65/0x70
    > [] icmp_rcv+0x180/0x1b0
    > [] ip_local_deliver+0xf4/0x1f0
    > [] ip_rcv+0x33b/0x650
    > [] netif_receive_skb+0x27a/0x340
    > [] process_backlog+0x9d/0x100
    > [] net_rx_action+0x18d/0x250
    > [] __do_softirq+0x75/0x100
    > [] call_softirq+0x1c/0x30
    > [] do_softirq+0x65/0xa0
    > [] irq_exit+0x97/0xa0
    > [] do_IRQ+0xa8/0x130
    > [] ? mwait_idle+0x0/0x60
    > [] ret_from_intr+0x0/0xf
    > [] ? mwait_idle+0x4c/0x60
    > [] ? mwait_idle+0x43/0x60
    > [] ? cpu_idle+0x57/0xa0
    > [] ? rest_init+0x70/0x80
    > Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
    > 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 8b 04 c3 48 83 c4 08
    > 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
    > RIP [] icmp_sk+0x17/0x30
    > RSP
    > CR2: 0000000000000000
    > ---[ end trace ea161157b76b33e8 ]---
    > Kernel panic - not syncing: Aiee, killing interrupt handler!

    Receiving packets while we are cleaning up a network namespace is a
    racy proposition. It is possible when the packet arrives that we have
    removed some but not all of the state we need to fully process it. We
    have the choice of either playing wack-a-mole with the cleanup routines
    or simply dropping packets when we don't have a network namespace to
    handle them.

    Since the check looks inexpensive in netif_receive_skb let's just
    drop the incoming packets.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 May, 2008

1 commit


16 Apr, 2008

1 commit

  • Make release_net/hold_net noop for performance-hungry people. This is a debug
    staff and should be used in the debug mode only.

    Add check for net != NULL in hold/release calls. This will be required
    later on.

    [ Added minor simplifications suggested by Brian Haley. -DaveM ]

    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

15 Apr, 2008

2 commits

  • Add the elastic array of void * pointer to the struct net.
    The access rules are simple:

    1. register the ops with register_pernet_gen_device to get
    the id of your private pointer
    2. call net_assign_generic() to put the private data on the
    struct net (most preferably this should be done in the
    ->init callback of the ops registered)
    3. do not store any private reference on the net_generic array;
    4. do not change this pointer while the net is alive;
    5. use the net_generic() to get the pointer.

    When adding a new pointer, I copy the old array, replace it
    with a new one and schedule the old for kfree after an RCU
    grace period.

    Since the net_generic explores the net->gen array inside rcu
    read section and once set the net->gen->ptr[x] pointer never
    changes, this grants us a safe access to generic pointers.

    Quoting Paul: "... RCU is protecting -only- the net_generic
    structure that net_generic() is traversing, and the [pointer]
    returned by net_generic() is protected by a reference counter
    in the upper-level struct net."

    Signed-off-by: Pavel Emelyanov
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • To make some per-net generic pointers, we need some way to address
    them, i.e. - IDs. This is simple IDA-based IDs generator for pernet
    subsystems.

    Addressing questions about potential checkpoint/restart problems:
    these IDs are "lite-offsets" within the net structure and are by no
    means supposed to be exported to the userspace.

    Since it will be used in the nearest future by devices only (tun,
    vlan, tunnels, bridge, etc), I make it resemble the functionality
    of register_pernet_device().

    The new ids is stored in the *id pointer _before_ calling the init
    callback to make this id available in this callback.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

14 Apr, 2008

1 commit


04 Apr, 2008

1 commit

  • This does not look good, but there is no other choice. The compilation
    without CONFIG_NET is broken and can not be fixed with ease.

    After that there is no need for the following commits:
    1567ca7eec7664b8be3b07755ac59dc1b1ec76cb
    3edf8fa5ccf10688a9280b5cbca8ed3947c42866
    2d38f9a4f8d2ebdc799f03eecf82345825495711

    Revert them.

    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

02 Apr, 2008

2 commits


01 Apr, 2008

1 commit

  • There's already some stuff on the struct net, that should better
    be folded into netns_core structure. I'm making the per-proto inuse
    counter be per-net also, which is also a candidate for this, so
    introduce this structure and populate it a bit.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

26 Mar, 2008

1 commit


08 Mar, 2008

1 commit

  • Current /proc/net is done with so called "shadows", but current
    implementation is broken and has little chances to get fixed.

    The problem is that dentries subtree of /proc/net directory has
    fancy revalidation rules to make processes living in different
    net namespaces see different entries in /proc/net subtree, but
    currently, tasks see in the /proc/net subdir the contents of any
    other namespace, depending on who opened the file first.

    The proposed fix is to turn /proc/net into a symlink, which points
    to /proc/self/net, which in turn shows what previously was in
    /proc/net - the network-related info, from the net namespace the
    appropriate task lives in.

    # ls -l /proc/net
    lrwxrwxrwx 1 root root 8 Mar 5 15:17 /proc/net -> self/net

    In other words - this behaves like /proc/mounts, but unlike
    "mounts", "net" is not a file, but a directory.

    Changes from v2:
    * Fixed discrepancy of /proc/net nlink count and selinux labeling
    screwup pointed out by Stephen.

    To get the correct nlink count the ->getattr callback for /proc/net
    is overridden to read one from the net->proc_net entry.

    To make selinux still work the net->proc_net entry is initialized
    properly, i.e. with the "net" name and the proc_net parent.

    Selinux fixes are
    Acked-by: Stephen Smalley

    Changes from v1:
    * Fixed a task_struct leak in get_proc_task_net, pointed out by Paul.

    Signed-off-by: Pavel Emelyanov
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

01 Feb, 2008

1 commit

  • In fact all we want is per-netns set of rules, however doing that will
    unnecessary complicate routines such as ipt_hook()/ipt_do_table, so
    make full xt_table array per-netns.

    Every user stubbed with init_net for a while.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

29 Jan, 2008

12 commits

  • Move static rules_ops & rules_mod_lock to the struct net, register the
    pernet subsys to init them and enjoy the fact that the core rules
    infrastructure works in the namespace.

    Real IPv4 fib rules virtualization requires fib tables support in the
    namespace and will be done seriously later in the patchset.

    Acked-by: Benjamin Thery
    Acked-by: Daniel Lezcano
    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Like the ipv4 part, this patch adds an ipv6 structure in the net
    structure to aggregate the different resources to make ipv6 per
    namespace.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The ipv4 will store its parameters inside this structure.
    This one is empty now, but it will be eventually filled.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Recently David Miller and Herbert Xu pointed out that struct net becomes
    overbloated and un-maintainable. There are two solutions:
    - provide a pointer to a network subsystem definition from struct net.
    This costs an additional dereferrence
    - place sub-system definition into the structure itself. This will speedup
    run-time access at the cost of recompilation time

    The second approach looks better for us. Other sub-systems will follow.

    Signed-off-by: Denis V. Lunev
    Acked-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Just move the variable on the struct net and adjust
    its usage.

    Others sysctls from sys.net.core table are more
    difficult to virtualize (i.e. make them per-namespace),
    but I'll look at them as well a bit later.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Making them per-namespace is required for the following
    two reasons:

    First, some ctl values have a per-namespace meaning.
    Second, making them writable from the sub-namespace
    is an isolation hole.

    So I introduce the pernet operations to create these
    tables. For init_net I use the existing statically
    declared tables, for sub-namespace they are duplicated
    and the write bits are removed from the mode.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This is the core.

    * add the ctl_table_header on the struct net;
    * make the unix_sysctl_register and _unregister clone the table;
    * moves calls to them into per-net init and exit callbacks;
    * move the .data pointer in the proper place.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This will make all the sub-namespaces always use the
    default value (10) and leave the tuning via sysctl
    to the init namespace only.

    Per-namespace tuning is coming.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric W. Biederman
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The user interface is: register_net_sysctl_table and
    unregister_net_sysctl_table. Very much like the current
    interface except there is a network namespace parameter.

    With this any sysctl registered with register_net_sysctl_table
    will only show up to tasks in the same network namespace.

    All other sysctls continue to be globally visible.

    Signed-off-by: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Daniel Lezcano
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This is done by making packet_sklist_lock and packet_sklist per
    network namespace and adding an additional filter condition on
    received packets to ensure they came from the proper network
    namespace.

    Changes from v1:
    - prohibit to call inet_dgram_ops.ioctl in other than init_net

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • After this patch none of the netlink callback support anything
    except the initial network namespace but the rtnetlink infrastructure
    now handles multiple network namespaces.

    Changes from v2:
    - IPv6 addrlabel processing

    Changes from v1:
    - no need for special rtnl_unlock handling
    - fixed IPv6 ndisc

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

13 Nov, 2007

1 commit


01 Nov, 2007

1 commit


27 Oct, 2007

1 commit

  • It is not safe to to place struct pernet_operations in a special section.
    We need struct pernet_operations to last until we call unregister_pernet_subsys.
    Which doesn't happen until module unload.

    So marking struct pernet_operations is a disaster for modules in two ways.
    - We discard it before we call the exit method it points to.
    - Because I keep struct pernet_operations on a linked list discarding
    it for compiled in code removes elements in the middle of a linked
    list and does horrible things for linked insert.

    So this looks safe assuming __exit_refok is not discarded
    for modules.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

11 Oct, 2007

4 commits

  • With the net namespaces many code leaved the __init section,
    thus making the kernel occupy more memory than it did before.
    Since we have a config option that prohibits the namespace
    creation, the functions that initialize/finalize some netns
    stuff are simply not needed and can be freed after the boot.

    Currently, this is almost not noticeable, since few calls
    are no longer in __init, but when the namespaces will be
    merged it will be possible to free more code. I propose to
    use the __net_init, __net_exit and __net_initdata "attributes"
    for functions/variables that are not used if the CONFIG_NET_NS
    is not set to save more space in memory.

    The exiting functions cannot just reside in the __exit section,
    as noticed by David, since the init section will have
    references on it and the compilation will fail due to modpost
    checks. These references can exist, since the init namespace
    never dies and the exit callbacks are never called. So I
    introduce the __exit_refok attribute just like it is already
    done with the __init_refok.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Denis V. Lunev noticed that the locking rules
    for the network namespace list are over complicated and broken.

    In particular the current register_netdev_notifier currently
    does not take any lock making the for_each_net iteration racy
    with network namespace creation and destruction. Oops.

    The fact that we need to use for_each_net in rtnl_unlock() when
    the rtnetlink support becomes per network namespace makes designing
    the proper locking tricky. In addition we need to be able to call
    rtnl_lock() and rtnl_unlock() when we have the net_mutex held.

    After thinking about it and looking at the alternatives carefully
    it looks like the simplest and most maintainable solution is
    to remove net_list_mutex altogether, and to use the rtnl_mutex instead.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes loopback_dev per network namespace. Adding
    code to create a different loopback device for each network
    namespace and adding the code to free a loopback device
    when a network namespace exits.

    This patch modifies all users the loopback_dev so they
    access it as init_net.loopback_dev, keeping all of the
    code compiling and working. A later pass will be needed to
    update the users to use something other than the initial network
    namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch allows you to create a new network namespace
    using sys_clone, or sys_unshare.

    As the network namespace is still experimental and under development
    clone and unshare support is only made available when CONFIG_NET_NS is
    selected at compile time.

    As this patch introduces network namespace support into code paths
    that exist when the CONFIG_NET is not selected there are a few
    additions made to net_namespace.h to allow a few more functions
    to be used when the networking stack is not compiled in.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman