17 Nov, 2008

1 commit

  • Unlike ifconfig, iproute doesn't report an error when setting
    an interface up fails:

    (example: put wireless network mac80211 interface into repeater mode
    with iwconfig but do not set a peer MAC address, it should fail with
    -ENOLINK)

    without patch:
    # ip link set wlan0 up ; echo $?
    0
    #

    with patch:
    # ip link set wlan0 up ; echo $?
    RTNETLINK answers: Link has been severed
    2
    #

    Propagate the return value from dev_change_flags() to fix this.

    Signed-off-by: Patrick McHardy
    Tested-by: Johannes Berg
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

17 Oct, 2008

1 commit


09 Oct, 2008

1 commit


08 Oct, 2008

1 commit

  • Benjamin Thery tracked down a bug that explains many instances
    of the error

    unregister_netdevice: waiting for %s to become free. Usage count = %d

    It turns out that netdev_run_todo can dead-lock with itself if
    a second instance of it is run in a thread that will then free
    a reference to the device waited on by the first instance.

    The problem is really quite silly. We were trying to create
    parallelism where none was required. As netdev_run_todo always
    follows a RTNL section, and that todo tasks can only be added
    with the RTNL held, by definition you should only need to wait
    for the very ones that you've added and be done with it.

    There is no need for a second mutex or spinlock.

    This is exactly what the following patch does.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

23 Sep, 2008

1 commit

  • This patch add support for keeping an additional character alias
    associated with an network interface. This is useful for maintaining
    the SNMP ifAlias value which is a user defined value. Routers use this
    to hold information like which circuit or line it is connected to. It
    is just an arbitrary text label on the network device.

    There are two exposed interfaces with this patch, the value can be
    read/written either via netlink or sysfs.

    This could be maintained just by the snmp daemon, but it is more
    generally useful for other management tools, and the kernel is good
    place to act as an agreed upon interface to store it.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

18 Jul, 2008

1 commit

  • alloc_netdev_mq() now allocates an array of netdev_queue
    structures for TX, based upon the queue_count argument.

    Furthermore, all accesses to the TX queues are now vectored
    through the netdev_get_tx_queue() and netdev_for_each_tx_queue()
    interfaces. This makes it easy to grep the tree for all
    things that want to get to a TX queue of a net device.

    Problem spots which are not really multiqueue aware yet, and
    only work with one queue, can easily be spotted by grepping
    for all netdev_get_tx_queue() calls that pass in a zero index.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Jul, 2008

1 commit


10 Jun, 2008

1 commit


04 Jun, 2008

1 commit

  • Make nlmsg_trim(), nlmsg_cancel(), genlmsg_cancel(), and
    nla_nest_cancel() void functions.

    Return -EMSGSIZE instead of -1 if the provided message buffer is not
    big enough.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

22 May, 2008

1 commit


24 Apr, 2008

1 commit

  • ASSERT_RTNL uses mutex_trylock to test whether the rtnl_mutex is
    held. This bogus warnings when running in atomic context, which
    f.e. happens when adding secondary unicast addresses through
    macvlan or vlan or when synchronizing multicast addresses from
    wireless devices.

    Mid-term we might want to consider moving all address updates
    to process context since the locking seems overly complicated,
    for now just fix the bogus warning by changing ASSERT_RTNL to
    use mutex_is_locked().

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

16 Apr, 2008

2 commits


26 Mar, 2008

2 commits


24 Feb, 2008

1 commit

  • RTM_NEWLINK allows for already existing links to be modified. For this
    purpose do_setlink() is called which expects address attributes with a
    payload length of at least dev->addr_len. This patch adds the necessary
    validation for the RTM_NEWLINK case.

    The address length for links to be created is not checked for now as the
    actual attribute length is used when copying the address to the netdevice
    structure. It might make sense to report an error if less than addr_len
    bytes are provided but enforcing this might break drivers trying to be
    smart with not transmitting all zero addresses.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

20 Feb, 2008

1 commit


18 Feb, 2008

1 commit


13 Feb, 2008

1 commit

  • In do_setlink() a single notification is sent at the end of the
    function if any modification occured. If the address has been changed,
    another notification is sent.

    Both of them is required because originally only the NETDEV_CHANGEADDR
    notification was sent and although device state change implies address
    change, some programs may expect the original notification. It remains
    for compatibity.

    If set_operstate() is called from do_setlink(), it doesn't send a
    notification, only if it is called from rtnl_create_link() as earlier.

    Signed-off-by: Laszlo Attila Toth
    Signed-off-by: David S. Miller

    Laszlo Attila Toth
     

05 Feb, 2008

1 commit


29 Jan, 2008

6 commits

  • During network namespace stop process kernel side netlink sockets
    belonging to a namespace should be closed. They should not prevent
    namespace to stop, so they do not increment namespace usage
    counter. Though this counter will be put during last sock_put.

    The raplacement of the correct netns for init_ns solves the problem
    only partial as socket to be stoped until proper stop is a valid
    netlink kernel socket and can be looked up by the user processes. This
    is not a problem until it resides in initial namespace (no processes
    inside this net), but this is not true for init_net.

    So, hold the referrence for a socket, remove it from lookup tables and
    only after that change namespace and perform a last put.

    Signed-off-by: Denis V. Lunev
    Tested-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Create a specific helper for netlink kernel socket disposal. This just
    let the code look better and provides a ground for proper disposal
    inside a namespace.

    Signed-off-by: Denis V. Lunev
    Tested-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Network namespace allocates 2 kernel netlink sockets, fibnl &
    rtnl. These sockets should be disposed properly, i.e. by
    sock_release. Plain sock_put is not enough.

    Signed-off-by: Denis V. Lunev
    Tested-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • After the previous prep work this just consists of removing checks
    limiting the code to work in the initial network namespace, and
    updating rtmsg_ifinfo so we can generate events for devices in
    something other then the initial network namespace.

    Referring to network other network devices like the IFLA_LINK
    and IFLA_MASTER attributes do, gets interesting if those network
    devices happen to be in other network namespaces. Currently
    ifindex numbers are allocated globally so I have taken the path
    of least resistance and not still report the information even
    though the devices they are talking about are invisible.

    If applications start getting confused or when ifindex
    numbers become local to the network namespace we may need
    to do something different in the future.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • After this patch none of the netlink callback support anything
    except the initial network namespace but the rtnetlink infrastructure
    now handles multiple network namespaces.

    Changes from v2:
    - IPv6 addrlabel processing

    Changes from v1:
    - no need for special rtnl_unlock handling
    - fixed IPv6 ndisc

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • Before I can enable rtnetlink to work in all network namespaces I need
    to be certain that something won't break. So this patch deliberately
    disables all of the rtnletlink methods in everything except the
    initial network namespace. After the methods have been audited this
    extra check can be disabled.

    Changes from v1:
    - added IPv6 addrlabel protection

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller
    Signed-off-by: Herbert Xu

    Denis V. Lunev
     

21 Jan, 2008

1 commit

  • When unregistering the rtnl_link_ops, all existing devices using
    the ops are destroyed. With nested devices this may lead to a
    use-after-free despite the use of for_each_netdev_safe() in case
    the upper device is next in the device list and is destroyed
    by the NETDEV_UNREGISTER notifier.

    The easy fix is to restart scanning the device list after removing
    a device. Alternatively we could add new devices to the front of
    the list to avoid having dependant devices follow the device they
    depend on. A third option would be to only restart scanning if
    dev->iflink of the next device matches dev->ifindex of the current
    one. For now this seems like the safest solution.

    With this patch, the veth rtnl_link_ops unregistration can use
    rtnl_link_unregister() directly since it now also handles destruction
    of multiple devices at once.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

27 Oct, 2007

1 commit

  • The pid namespace patches changed the semantics of
    find_task_by_pid without breaking the compile resulting
    in get_net_ns_by_pid doing the wrong thing.

    So switch to using the intended find_task_by_vpid.

    Combined with Denis' earlier patch to make netlink traffic
    fully synchronous the inadvertent race I introduced with
    accessing current is actually removed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 Oct, 2007

1 commit

  • When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists. This is
    slow on read-only paths and may be impossible in some cases.

    E.g. Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.

    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.

    The access to other task namespaces is proposed to be performed
    like this:

    rcu_read_lock();
    nsproxy = task_nsproxy(tsk);
    if (nsproxy != NULL) {
    / *
    * work with the namespaces here
    * e.g. get the reference on one of them
    * /
    } / *
    * NULL task_nsproxy() means that this task is
    * almost dead (zombie)
    * /
    rcu_read_unlock();

    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.

    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Serge Hallyn
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

11 Oct, 2007

9 commits

  • This patch make processing netlink user -> kernel messages synchronious.
    This change was inspired by the talk with Alexey Kuznetsov about current
    netlink messages processing. He says that he was badly wrong when introduced
    asynchronious user -> kernel communication.

    The call netlink_unicast is the only path to send message to the kernel
    netlink socket. But, unfortunately, it is also used to send data to the
    user.

    Before this change the user message has been attached to the socket queue
    and sk->sk_data_ready was called. The process has been blocked until all
    pending messages were processed. The bad thing is that this processing
    may occur in the arbitrary process context.

    This patch changes nlk->data_ready callback to get 1 skb and force packet
    processing right in the netlink_unicast.

    Kernel -> user path in netlink_unicast remains untouched.

    EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
    drop, but the process remains in the cycle until the message will be fully
    processed. So, there is no need to use this kludges now.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • There is no need to process outstanding netlink user->kernel packets
    during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
    anymore.

    Normal code path is the following:
    netlink_sendmsg
    netlink_unicast
    netlink_sendskb
    skb_queue_tail
    netlink_data_ready
    rtnetlink_rcv
    mutex_lock(&rtnl_mutex);
    netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
    mutex_unlock(&rtnl_mutex);

    So, it is possible, that packets can be present in the rtnl->sk_receive_queue
    during rtnl_unlock, but there is no need to process them at that moment as
    rtnetlink_rcv for that packet is pending.

    Signed-off-by: Denis V. Lunev
    Acked-by: Alexey Kuznetsov
    Signed-off-by: David S. Miller

    Denis V. Lunev
     
  • I was looking at Patrick's fix to inet_diag and it occured
    to me that we're using a pointer argument to return values
    unnecessarily in netlink_run_queue. Changing it to return
    the value will allow the compiler to generate better code
    since the value won't have to be memory-backed.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The simplest thing to implement is moving network devices between
    namespaces. However with the same attribute IFLA_NET_NS_PID we can
    easily implement creating devices in the destination network
    namespace as well. However that is a little bit trickier so this
    patch sticks to what is simple and easy.

    A pid is used to identify a process that happens to be a member
    of the network namespace we want to move the network device to.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes most of the generic device layer network
    namespace safe. This patch makes dev_base_head a
    network namespace variable, and then it picks up
    a few associated variables. The functions:
    dev_getbyhwaddr
    dev_getfirsthwbytype
    dev_get_by_flags
    dev_get_by_name
    __dev_get_by_name
    dev_get_by_index
    __dev_get_by_index
    dev_ioctl
    dev_ethtool
    dev_load
    wireless_process_ioctl

    were modified to take a network namespace argument, and
    deal with it.

    vlan_ioctl_set and brioctl_set were modified so their
    hooks will receive a network namespace argument.

    So basically anthing in the core of the network stack that was
    affected to by the change of dev_base was modified to handle
    multiple network namespaces. The rest of the network stack was
    simply modified to explicitly use &init_net the initial network
    namespace. This can be fixed when those components of the network
    stack are modified to handle multiple network namespaces.

    For now the ifindex generator is left global.

    Fundametally ifindex numbers are per namespace, or else
    we will have corner case problems with migration when
    we get that far.

    At the same time there are assumptions in the network stack
    that the ifindex of a network device won't change. Making
    the ifindex number global seems a good compromise until
    the network stack can cope with ifindex changes when
    you change namespaces, and the like.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each netlink socket will live in exactly one network namespace,
    this includes the controlling kernel sockets.

    This patch updates all of the existing netlink protocols
    to only support the initial network namespace. Request
    by clients in other namespaces will get -ECONREFUSED.
    As they would if the kernel did not have the support for
    that netlink protocol compiled in.

    As each netlink protocol is updated to be multiple network
    namespace safe it can register multiple kernel sockets
    to acquire a presence in the rest of the network namespaces.

    The implementation in af_netlink is a simple filter implementation
    at hash table insertion and hash table look up time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Every user of the network device notifiers is either a protocol
    stack or a pseudo device. If a protocol stack that does not have
    support for multiple network namespaces receives an event for a
    device that is not in the initial network namespace it quite possibly
    can get confused and do the wrong thing.

    To avoid problems until all of the protocol stacks are converted
    this patch modifies all netdev event handlers to ignore events on
    devices that are not in the initial network namespace.

    As the rest of the code is made network namespace aware these
    checks can be removed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This routine gets the parsed rtnl attributes and creates a new
    link with generic info (IFLA_LINKINFO policy). Its intention
    is to help the drivers, that need to create several links at
    once (like VETH).

    This is nothing but a copy-paste-ed part of rtnl_newlink() function
    that is responsible for creation of new device.

    Signed-off-by: Pavel Emelianov
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pavel Emelianov
     
  • Several devices have multiple independant RX queues per net
    device, and some have a single interrupt doorbell for several
    queues.

    In either case, it's easier to support layouts like that if the
    structure representing the poll is independant from the net
    device itself.

    The signature of the ->poll() call back goes from:

    int foo_poll(struct net_device *dev, int *budget)

    to

    int foo_poll(struct napi_struct *napi, int budget)

    The caller is returned the number of RX packets processed (or
    the number of "NAPI credits" consumed if you want to get
    abstract). The callee no longer messes around bumping
    dev->quota, *budget, etc. because that is all handled in the
    caller upon return.

    The napi_struct is to be embedded in the device driver private data
    structures.

    Furthermore, it is the driver's responsibility to disable all NAPI
    instances in it's ->stop() device close handler. Since the
    napi_struct is privatized into the driver's private data structures,
    only the driver knows how to get at all of the napi_struct instances
    it may have per-device.

    With lots of help and suggestions from Rusty Russell, Roland Dreier,
    Michael Chan, Jeff Garzik, and Jamal Hadi Salim.

    Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
    Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.

    [ Ported to current tree and all drivers converted. Integrated
    Stephen's follow-on kerneldoc additions, and restored poll_list
    handling to the old style to fix mutual exclusion issues. -DaveM ]

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

01 Aug, 2007

1 commit


19 Jul, 2007

1 commit