11 Oct, 2007

40 commits

  • The problem: proc_net files remember which network namespace the are
    against but do not remember hold a reference count (as that would pin
    the network namespace). So we currently have a small window where
    the reference count on a network namespace may be incremented when opening
    a /proc file when it has already gone to zero.

    To fix this introduce maybe_get_net and get_proc_net.

    maybe_get_net increments the network namespace reference count only if it is
    greater then zero, ensuring we don't increment a reference count after it
    has gone to zero.

    get_proc_net handles all of the magic to go from a proc inode to the network
    namespace instance and call maybe_get_net on it.

    PROC_NET the old accessor is removed so that we don't get confused and use
    the wrong helper function.

    Then I fix up the callers to use get_proc_net and handle the case case
    where get_proc_net returns NULL. In that case I return -ENXIO because
    effectively the network namespace has already gone away so the files
    we are trying to access don't exist anymore.

    Signed-off-by: Eric W. Biederman
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • When CONFIG_NET=no, init_net is unresolved because net_namespace.c
    is not compiled and the include pull init_net definition.

    This problem was very similar with the ipc namespace where the kernel
    can be compiled with SYSV ipc out.

    This patch fix that defining a macro which simply remove init_net
    initialization from nsproxy namespace aggregator.

    Compiled and booted on qemu-i386 with CONFIG_NET=no and CONFIG_NET=yes.

    Signed-off-by: Daniel Lezcano
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • This is done in order to, add support to changing the rate table to
    use the upper-boundry L2T (length to time) value. Currently we use the
    lower-boundry, which result in under-estimating the actual bandwidth
    usage.

    Extend the tc_ratespec struct, with two parameters: 1) "cell_align"
    that allow adjusting the alignment of the rate table. 2) "overhead"
    that allow adding a packet overhead before the lookup.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Change L2T (length to time) macros, in all rate based schedulers, to
    call a common function qdisc_l2t() that does the rate table lookup.
    This function handles if the packet size lookup is larger than the
    rate table, which often occurs with TSO enabled.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • This patch makes the following needlessly global variables static:
    - sctp_memory_pressure
    - sctp_memory_allocated
    - sctp_sockets_allocated

    Signed-off-by: Adrian Bunk
    Signed-off-by: David S. Miller

    Adrian Bunk
     
  • sctp_addto_param() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: David S. Miller

    Adrian Bunk
     
  • raise_softirq_irqoff no longer has any modular user.

    Signed-off-by: Adrian Bunk
    Signed-off-by: David S. Miller

    Adrian Bunk
     
  • The macro definition is bad. When calling next_net_device with
    parameter name "dev", the resulting code is:
    struct net_device *dev = dev and that leads to an unexpected
    behavior. Especially when llc_core is compiled in, the kernel panics
    at boot time.
    The patchset change macro definition with static inline functions as
    they were defined before.

    Signed-off-by: Benjamin Thery
    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • The core patchset of the network namespace sent by
    Eric Biederman does not do dynamic loopback creation.
    So there is no call to alloc_netdev_mq which fills the
    network namespace field of the netdevice.

    This patch assign the loopback to the init network namespace.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • Add the appropriate EXPORT_SYMBOLS for proc_net_create,
    proc_net_fops_create and proc_net_remove to fix errors when
    compiling allmodconfig

    Signed-off-by: Mark Nelson
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • This change allows the generic attribute interface to be used within
    the netfilter subsystem where this flag was initially introduced.

    The byte-order flag is yet unused, it's intended use is to
    allow automatic byte order convertions for all atomic types.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Requested by Johannes Berg.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When the periodic IP route cache flush is done (every 600 seconds on
    default configuration), some hosts suffer a lot and eventually trigger
    the "soft lockup" message.

    dst_run_gc() is doing a scan of a possibly huge list of dst_entries,
    eventually freeing some (less than 1%) of them, while holding the
    dst_lock spinlock for the whole scan.

    Then it rearms a timer to redo the full thing 1/10 s later...
    The slowdown can last one minute or so, depending on how active are
    the tcp sessions.

    This second version of the patch converts the processing from a softirq
    based one to a workqueue.

    Even if the list of entries in garbage_list is huge, host is still
    responsive to softirqs and can make progress.

    Instead of resetting gc timer to 0.1 second if one entry was freed in a
    gc run, we do this if more than 10% of entries were freed.

    Before patch :

    Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0!
    Aug 16 06:21:37 SRV1 kernel:
    Aug 16 06:21:37 SRV1 kernel: Call Trace:
    Aug 16 06:21:37 SRV1 kernel: [] wake_up_process+0x10/0x20
    Aug 16 06:21:37 SRV1 kernel: [] softlockup_tick+0xe9/0x110
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x0/0x140
    Aug 16 06:21:37 SRV1 kernel: [] run_local_timers+0x13/0x20
    Aug 16 06:21:37 SRV1 kernel: [] update_process_times+0x57/0x90
    Aug 16 06:21:37 SRV1 kernel: [] smp_local_timer_interrupt+0x34/0x60
    Aug 16 06:21:37 SRV1 kernel: [] smp_apic_timer_interrupt+0x5c/0x80
    Aug 16 06:21:37 SRV1 kernel: [] apic_timer_interrupt+0x66/0x70
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x53/0x140
    Aug 16 06:21:37 SRV1 kernel: [] dst_run_gc+0x46/0x140
    Aug 16 06:21:37 SRV1 kernel: [] run_timer_softirq+0x148/0x1c0
    Aug 16 06:21:37 SRV1 kernel: [] __do_softirq+0x6c/0xe0
    Aug 16 06:21:37 SRV1 kernel: [] call_softirq+0x1c/0x30
    Aug 16 06:21:37 SRV1 kernel: [] do_softirq+0x34/0x90
    Aug 16 06:21:37 SRV1 kernel: [] local_bh_enable_ip+0x3f/0x60
    Aug 16 06:21:37 SRV1 kernel: [] _spin_unlock_bh+0x13/0x20
    Aug 16 06:21:37 SRV1 kernel: [] rt_garbage_collect+0x1d8/0x320
    Aug 16 06:21:37 SRV1 kernel: [] dst_alloc+0x1d/0xa0
    Aug 16 06:21:37 SRV1 kernel: [] __ip_route_output_key+0x573/0x800
    Aug 16 06:21:37 SRV1 kernel: [] sock_common_recvmsg+0x32/0x50
    Aug 16 06:21:37 SRV1 kernel: [] ip_route_output_flow+0x1c/0x60
    Aug 16 06:21:37 SRV1 kernel: [] tcp_v4_connect+0x150/0x610
    Aug 16 06:21:37 SRV1 kernel: [] inet_bind_bucket_create+0x17/0x60
    Aug 16 06:21:37 SRV1 kernel: [] inet_stream_connect+0xa6/0x2c0
    Aug 16 06:21:37 SRV1 kernel: [] _spin_lock_bh+0x11/0x30
    Aug 16 06:21:37 SRV1 kernel: [] lock_sock_nested+0xcf/0xe0
    Aug 16 06:21:37 SRV1 kernel: [] _spin_lock_bh+0x11/0x30
    Aug 16 06:21:37 SRV1 kernel: [] sys_connect+0x71/0xa0
    Aug 16 06:21:37 SRV1 kernel: [] tcp_setsockopt+0x1f/0x30
    Aug 16 06:21:37 SRV1 kernel: [] sock_common_setsockopt+0xf/0x20
    Aug 16 06:21:37 SRV1 kernel: [] sys_setsockopt+0x9d/0xc0
    Aug 16 06:21:37 SRV1 kernel: [] sys_ioctl+0x5e/0x80
    Aug 16 06:21:37 SRV1 kernel: [] system_call+0x7e/0x83

    After patch : (RT_CACHE_DEBUG set to 2 to get following traces)

    dst_total: 75469 delayed: 74109 work_perf: 141 expires: 150 elapsed: 8092 us
    dst_total: 78725 delayed: 73366 work_perf: 743 expires: 400 elapsed: 8542 us
    dst_total: 86126 delayed: 71844 work_perf: 1522 expires: 775 elapsed: 8849 us
    dst_total: 100173 delayed: 68791 work_perf: 3053 expires: 1256 elapsed: 9748 us
    dst_total: 121798 delayed: 64711 work_perf: 4080 expires: 1997 elapsed: 10146 us
    dst_total: 154522 delayed: 58316 work_perf: 6395 expires: 25 elapsed: 11402 us
    dst_total: 154957 delayed: 58252 work_perf: 64 expires: 150 elapsed: 6148 us
    dst_total: 157377 delayed: 57843 work_perf: 409 expires: 400 elapsed: 6350 us
    dst_total: 163745 delayed: 56679 work_perf: 1164 expires: 775 elapsed: 7051 us
    dst_total: 176577 delayed: 53965 work_perf: 2714 expires: 1389 elapsed: 8120 us
    dst_total: 198993 delayed: 49627 work_perf: 4338 expires: 1997 elapsed: 8909 us
    dst_total: 226638 delayed: 46865 work_perf: 2762 expires: 2748 elapsed: 7351 us

    I successfully reduced the IP route cache of many hosts by a four factor
    thanks to this patch. Previously, I had to disable "ip route flush cache"
    to avoid crashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • My bad.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We will undo this once it is actually used.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Until we support multiple network namespaces with netfilter only allow
    netfilter configuration in the initial network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The simplest thing to implement is moving network devices between
    namespaces. However with the same attribute IFLA_NET_NS_PID we can
    easily implement creating devices in the destination network
    namespace as well. However that is a little bit trickier so this
    patch sticks to what is simple and easy.

    A pid is used to identify a process that happens to be a member
    of the network namespace we want to move the network device to.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate
    a network device is local to a single network namespace and
    should never be moved. Useful for pseudo devices that we
    need an instance in each network namespace (like the loopback
    device) and for any device we find that cannot handle multiple
    network namespaces so we may trap them in the initial network
    namespace.

    This patch introduces the function dev_change_net_namespace
    a function used to move a network device from one network
    namespace to another. To the network device nothing
    special appears to happen, to the components of the network
    stack it appears as if the network device was unregistered
    in the network namespace it is in, and a new device
    was registered in the network namespace the device
    was moved to.

    This patch sets up a namespace device destructor that
    upon the exit of a network namespace moves all of the
    movable network devices to the initial network namespace
    so they are not lost.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • When forcibly changing the network namespace of a device
    I need something that can generate a name for the device
    in the new namespace without overwriting the old name.

    __dev_alloc_name provides me that functionality.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes most of the generic device layer network
    namespace safe. This patch makes dev_base_head a
    network namespace variable, and then it picks up
    a few associated variables. The functions:
    dev_getbyhwaddr
    dev_getfirsthwbytype
    dev_get_by_flags
    dev_get_by_name
    __dev_get_by_name
    dev_get_by_index
    __dev_get_by_index
    dev_ioctl
    dev_ethtool
    dev_load
    wireless_process_ioctl

    were modified to take a network namespace argument, and
    deal with it.

    vlan_ioctl_set and brioctl_set were modified so their
    hooks will receive a network namespace argument.

    So basically anthing in the core of the network stack that was
    affected to by the change of dev_base was modified to handle
    multiple network namespaces. The rest of the network stack was
    simply modified to explicitly use &init_net the initial network
    namespace. This can be fixed when those components of the network
    stack are modified to handle multiple network namespaces.

    For now the ifindex generator is left global.

    Fundametally ifindex numbers are per namespace, or else
    we will have corner case problems with migration when
    we get that far.

    At the same time there are assumptions in the network stack
    that the ifindex of a network device won't change. Making
    the ifindex number global seems a good compromise until
    the network stack can cope with ifindex changes when
    you change namespaces, and the like.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each netlink socket will live in exactly one network namespace,
    this includes the controlling kernel sockets.

    This patch updates all of the existing netlink protocols
    to only support the initial network namespace. Request
    by clients in other namespaces will get -ECONREFUSED.
    As they would if the kernel did not have the support for
    that netlink protocol compiled in.

    As each netlink protocol is updated to be multiple network
    namespace safe it can register multiple kernel sockets
    to acquire a presence in the rest of the network namespaces.

    The implementation in af_netlink is a simple filter implementation
    at hash table insertion and hash table look up time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Every user of the network device notifiers is either a protocol
    stack or a pseudo device. If a protocol stack that does not have
    support for multiple network namespaces receives an event for a
    device that is not in the initial network namespace it quite possibly
    can get confused and do the wrong thing.

    To avoid problems until all of the protocol stacks are converted
    this patch modifies all netdev event handlers to ignore events on
    devices that are not in the initial network namespace.

    As the rest of the code is made network namespace aware these
    checks can be removed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch modifies every packet receive function
    registered with dev_add_pack() to drop packets if they
    are not from the initial network namespace.

    This should ensure that the various network stacks do
    not receive packets in a anything but the initial network
    namespace until the code has been converted and is ready
    for them.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Except for carefully selected pseudo devices all network
    interfaces should start out in the initial network namespace.
    Ultimately it will be register_netdev that examines what
    dev->nd_net is set to and places a device in a network namespace.

    This patch modifies alloc_netdev to initialize the network
    namespace a device is in with the initial network namespace.
    This gets it right for the vast majority of devices so their
    drivers need not be modified and for those few pseudo devices
    that need something different they can change this parameter
    before calling register_netdevice.

    The network namespace parameter on a network device is not
    reference counted as the devices are inside of a network namespace
    and cannot remain in that namespace past the lifetime of the
    network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch passes in the namespace a new socket should be created in
    and has the socket code do the appropriate reference counting. By
    virtue of this all socket create methods are touched. In addition
    the socket create methods are modified so that they will fail if
    you attempt to create a socket in a non-default network namespace.

    Failing if we attempt to create a socket outside of the default
    network namespace ensures that as we incrementally make the network stack
    network namespace aware we will not export functionality that someone
    has not audited and made certain is network namespace safe.
    Allowing us to partially enable network namespaces before all of the
    exotic protocols are supported.

    Any protocol layers I have missed will fail to compile because I now
    pass an extra parameter into the socket creation code.

    [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Sockets need to get a reference to their network namespace,
    or possibly a simple hold if someone registers on the network
    namespace notifier and will free the sockets when the namespace
    is going to be destroyed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Please note that network devices do not increase the count
    count on the network namespace. The are inside the network
    namespace and so the network namespace tag is in the nature
    of a back pointer and so getting and putting the network namespace
    is unnecessary.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This is the network namespace from which all which all sockets
    and anything else under user control ultimately get their network
    namespace parameters.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This is the basic infrastructure needed to support network
    namespaces. This infrastructure is:
    - Registration functions to support initializing per network
    namespace data when a network namespaces is created or destroyed.

    - struct net. The network namespace data structure.
    This structure will grow as variables are made per network
    namespace but this is the minimal starting point.

    - Functions to grab a reference to the network namespace.
    I provide both get/put functions that keep a network namespace
    from being freed. And hold/release functions serve as weak references
    and will warn if their count is not zero when the data structure
    is freed. Useful for dealing with more complicated data structures
    like the ipv4 route cache.

    - A list of all of the network namespaces so we can iterate over them.

    - A slab for the network namespace data structure allowing leaks
    to be spotted.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The current implementation of dev_ifname makes maintenance difficult
    because updates to the implementation of the ioctl have to made in two
    places. So this patch updates dev_ifname32 to do a classic 32/64
    structure conversion and call sys_ioctl like the rest of the
    compat calls do.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This slightly improves code safety and clarity.

    Later network namespace patches touch this code so this is a
    preliminary cleanup.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch modifies the current ipsec audit layer
    by breaking it up into purpose driven audit calls.

    So far, the only audit calls made are when add/delete
    an SA/policy. It had been discussed to give each
    key manager it's own calls to do this, but I found
    there to be much redundnacy since they did the exact
    same things, except for how they got auid and sid, so I
    combined them. The below audit calls can be made by any
    key manager. Hopefully, this is ok.

    Signed-off-by: Joy Latten
    Signed-off-by: David S. Miller

    Joy Latten
     
  • The type of owner in sock_lock_t is currently (struct sock_iocb *),
    presumably for historical reasons. It is never used as this type, only
    tested as NULL or set to (void *)1. For clarity, this changes it to type
    int, and renames to owned, to avoid any possible type casting errors.

    Signed-off-by: John Heffner
    Signed-off-by: David S. Miller

    John Heffner
     
  • Changes asserts in sunrpc to use sock_owned_by_user() macro instead of
    referencing sock_lock.owner directly.

    Signed-off-by: John Heffner
    Signed-off-by: David S. Miller

    John Heffner
     
  • Removed sparse warnings from tg3 driver. The new logic seems fine (I
    don't immediately see where we are running over values for any of the
    variables that need to be saved).

    This patch compiles fine and I'm currently using a tg3 with the patched
    driver to post this patch as a basic proof of concept.

    Signed-off-by: Andy Gospodarek
    Signed-off-by: David S. Miller

    Andy Gospodarek
     
  • Andi mentioned he did something like this already, but never submitted
    it.

    The dhcp client application uses AF_PACKET with a packet filter to
    receive data. The application doesn't even use timestamps, but because
    the AF_PACKET API has timestamps, they get turned on globally which
    causes an expensive time of day lookup for every packet received on
    any system that uses the standard DHCP client.

    The fix is to not enable the timestamp (but use if if available).
    This causes the time lookup to only occur on those packets that are
    destined for the AF_PACKET socket. The timestamping occurs after
    packet filtering so all packets dropped by filtering to not cause a
    clock call.

    The one downside of this a a few microseconds additional delay added
    from the normal timestamping location (netif_rx) until the receive
    callback in AF_PACKET. But since the offset is fairly consistent it
    should not upset applications that do want really use timestamps, like
    wireshark.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • This trivial patch removes the unneeded pointer newdp, which is never used.

    Signed-off-by: Micah Gruber
    Signed-off-by: David S. Miller

    Micah Gruber
     
  • This trivial patch removes the unneeded pointer iph, which is never used.

    Signed-off-by: Micah Gruber
    Signed-off-by: David S. Miller

    Micah Gruber
     
  • The sta_info.assoc_ap value is used as a flag, move it
    into flags.

    Signed-off-by: Johannes Berg
    Acked-by: Michael Wu
    Signed-off-by: John W. Linville
    Signed-off-by: David S. Miller

    Johannes Berg