29 Mar, 2013

1 commit


22 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

12 Jan, 2013

1 commit

  • In fib_frontend.c, there is a confusing comment; NETLINK_CB(skb).portid does not
    refer to a pid of sending process, but rather to a netlink portid.

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

19 Nov, 2012

3 commits

  • - Only allow moving network devices to network namespaces you have
    CAP_NET_ADMIN privileges over.

    - Enable creating/deleting/modifying interfaces
    - Enable adding/deleting addresses
    - Enable adding/setting/deleting neighbour entries
    - Enable adding/removing routes
    - Enable adding/removing fib rules
    - Enable setting the forwarding state
    - Enable adding/removing ipv6 address labels
    - Enable setting bridge parameter

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Settings that merely control a single network device are allowed.
    Either the network device is a logical network device where
    restrictions make no difference or the network device is hardware NIC
    that has been explicity moved from the initial network namespace.

    In general policy and network stack state changes are allowed
    while resource control is left unchanged.

    Allow creating raw sockets.
    Allow the SIOCSARP ioctl to control the arp cache.
    Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
    Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
    Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
    Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
    Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
    Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting gre tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipip tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipsec virtual tunnel interfaces.

    Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
    MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
    sockets.

    Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
    arbitrary ip options.

    Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
    Allow setting the IP_TRANSPARENT ipv4 socket option.
    Allow setting the TCP_REPAIR socket option.
    Allow setting the TCP_CONGESTION socket option.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
    to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
    users to make netlink calls to modify their local network
    namespace.

    - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
    that calls that are not safe for unprivileged users are still
    protected.

    Later patches will remove the extra capable calls from methods
    that are safe for unprivilged users.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

09 Oct, 2012

1 commit

  • After "Cache input routes in fib_info nexthops" (commit
    d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
    (commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
    should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
    the same fib_info can be used for traffic that is not redirected,
    eg. from other input devices or from sources that are not in same subnet.

    As result, we have to disable the caching of RTCF_DOREDIRECT
    flag and to force source validation for the case when forwarding
    traffic to the input device. If traffic comes from directly connected
    source we allow redirection as it was done before both changes.

    Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS
    is disabled, this can avoid source address validation and to
    help caching the routes.

    After the change "Adjust semantics of rt->rt_gateway"
    (commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
    contain daddr instead of 0.0.0.0 when target is directly connected.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

09 Sep, 2012

1 commit


08 Sep, 2012

1 commit


24 Aug, 2012

1 commit

  • Eric Biederman pointed out that not holding RTNL while calling
    call_netdevice_notifiers() was racy.

    This patch is a direct transcription his feedback
    against commit 0115e8e30d6fc (net: remove delay at device dismantle)

    Thanks Eric !

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2012

1 commit

  • I noticed extra one second delay in device dismantle, tracked down to
    a call to dst_dev_event() while some call_rcu() are still in RCU queues.

    These call_rcu() were posted by rt_free(struct rtable *rt) calls.

    We then wait a little (but one second) in netdev_wait_allrefs() before
    kicking again NETDEV_UNREGISTER.

    As the call_rcu() are now completed, dst_dev_event() can do the needed
    device swap on busy dst.

    To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
    after a rcu_barrier(), but outside of RTNL lock.

    Use NETDEV_UNREGISTER_FINAL with care !

    Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

    Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
    IP cache removal.

    With help from Gao feng

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2012

1 commit

  • As pointed out, there are places, that access net->loopback_dev->ifindex
    and after ifindex generation is made per-net this value becomes constant
    equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
    it where appropriate.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

01 Aug, 2012

1 commit

  • When a device is unregistered, we have to purge all of the
    references to it that may exist in the entire system.

    If a route is uncached, we currently have no way of accomplishing
    this.

    So create a global list that is scanned when a network device goes
    down. This mirrors the logic in net/core/dst.c's dst_ifdown().

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jul, 2012

1 commit


21 Jul, 2012

1 commit

  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    Signed-off-by: David S. Miller

    David S. Miller
     

19 Jul, 2012

1 commit


13 Jul, 2012

1 commit


06 Jul, 2012

1 commit

  • If the user hasn't actually installed any custom rules, or fiddled
    with the default ones, don't go through the whole FIB rules layer.

    It's just pure overhead.

    Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check
    the individual tables by hand, one by one.

    Also, move fib_num_tclassid_users into the ipv4 network namespace.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jun, 2012

1 commit

  • This patch adds the following structure:

    struct netlink_kernel_cfg {
    unsigned int groups;
    void (*input)(struct sk_buff *skb);
    struct mutex *cb_mutex;
    };

    That can be passed to netlink_kernel_create to set optional configurations
    for netlink kernel sockets.

    I've populated this structure by looking for NULL and zero parameters at the
    existing code. The remaining parameters that always need to be set are still
    left in the original interface.

    That includes optional parameters for the netlink socket creation. This allows
    easy extensibility of this interface in the future.

    This patch also adapts all callers to use this new interface.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

29 Jun, 2012

3 commits

  • If rpfilter is off (or the SKB has an IPSEC path) and there are not
    tclassid users, we don't have to do anything at all when
    fib_validate_source() is invoked besides setting the itag to zero.

    We monitor tclassid uses with a counter (modified only under RTNL and
    marked __read_mostly) and we protect the fib_validate_source() real
    work with a test against this counter and whether rpfilter is to be
    done.

    Having a way to know whether we need no tclassid processing or not
    also opens the door for future optimized rpfilter algorithms that do
    not perform full FIB lookups.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Checking for in_dev being NULL is pointless.

    In fact, all of our callers have in_dev precomputed already,
    so just pass it in and remove the NULL checking.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Based upon feedback from Julian Anastasov.

    1) Use route flags to determine multicast/broadcast, not the
    packet flags.

    2) Leave saddr unspecified in flow key.

    3) Adjust how we invoke inet_select_addr(). Pass ip_hdr(skb)->saddr as
    second arg, and if it was zeronet use link scope.

    4) Use loopback as input interface in flow key.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Jun, 2012

2 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • The specific destination is the host we direct unicast replies to.
    Usually this is the original packet source address, but if we are
    responding to a multicast or broadcast packet we have to use something
    different.

    Specifically we must use the source address we would use if we were to
    send a packet to the unicast source of the original packet.

    The routing cache precomputes this value, but we want to remove that
    precomputation because it creates a hard dependency on the expensive
    rpfilter source address validation which we'd like to make cheaper.

    There are only three places where this matters:

    1) ICMP replies.

    2) pktinfo CMSG

    3) IP options

    Now there will be no real users of rt->rt_spec_dst and we can simply
    remove it altogether.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Apr, 2012

1 commit


29 Mar, 2012

1 commit


12 Mar, 2012

1 commit

  • Use a more current kernel messaging style.

    Convert a printk block to print_hex_dump.
    Coalesce formats, align arguments.
    Use %s, __func__ instead of embedding function names.

    Some messages that were prefixed with _close are
    now prefixed with _fini. Some ah4 and esp messages
    are now not prefixed with "ip ".

    The intent of this patch is to later add something like
    #define pr_fmt(fmt) "IPv4: " fmt.
    to standardize the output messages.

    Text size is trivially reduced. (x86-32 allyesconfig)

    $ size net/ipv4/built-in.o*
    text data bss dec hex filename
    887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
    887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

10 Jun, 2011

1 commit

  • The message size allocated for rtnl ifinfo dumps was limited to
    a single page. This is not enough for additional interface info
    available with devices that support SR-IOV and caused a bug in
    which VF info would not be displayed if more than approximately
    40 VFs were created per interface.

    Implement a new function pointer for the rtnl_register service that will
    calculate the amount of data required for the ifinfo dump and allocate
    enough data to satisfy the request.

    Signed-off-by: Greg Rose
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

11 Apr, 2011

2 commits

  • The reverse path filter interferes with IPsec subnet-to-subnet tunnels,
    especially when the link to the IPsec peer is on an interface other than
    the one hosting the default route.

    With dynamic routing, where the peer might be reachable through eth0
    today and eth1 tomorrow, it's difficult to keep rp_filter enabled unless
    fake routes to the remote subnets are configured on the interface
    currently used to reach the peer.

    IPsec provides a much stronger anti-spoofing policy than rp_filter, so
    this patch disables the rp_filter for packets with a security path.

    Signed-off-by: Michael Smith
    Signed-off-by: David S. Miller

    Michael Smith
     
  • This makes sk_buff available for other use in fib_validate_source().

    Signed-off-by: Michael Smith
    Signed-off-by: David S. Miller

    Michael Smith
     

31 Mar, 2011

1 commit


25 Mar, 2011

1 commit

  • Any operation that:

    1) Brings up an interface
    2) Adds an IP address to an interface
    3) Deletes an IP address from an interface

    can potentially invalidate the nh_saddr value, requiring
    it to be recomputed.

    Perform the recomputation lazily using a generation ID.

    Reported-by: Julian Anastasov
    Signed-off-by: David S. Miller

    David S. Miller
     

22 Mar, 2011

1 commit

  • Alex Sidorenko reported for problems with local
    routes left after IP addresses are deleted. It happens
    when same IPs are used in more than one subnet for the
    device.

    Fix fib_del_ifaddr to restrict the checks for duplicate
    local and broadcast addresses only to the IFAs that use
    our primary IFA or another primary IFA with same address.
    And we expect the prefsrc to be matched when the routes
    are deleted because it is possible they to differ only by
    prefsrc. This patch prevents local and broadcast routes
    to be leaked until their primary IP is deleted finally
    from the box.

    As the secondary address promotion needs to delete
    the routes for all secondaries that used the old primary IFA,
    add option to ignore these secondaries from the checks and
    to assume they are already deleted, so that we can safely
    delete the route while these IFAs are still on the device list.

    Reported-by: Alex Sidorenko
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

13 Mar, 2011

3 commits


10 Mar, 2011

1 commit


08 Mar, 2011

1 commit

  • When doing output route lookups, we have to select the source address
    if the user has not specified an explicit one.

    First, if the route has an explicit preferred source address
    specified, then we use that.

    Otherwise we search the route's outgoing interface for a suitable
    address.

    This search can be precomputed and cached at route insertion time.

    The only missing part is that we have to refresh this precomputed
    value any time addresses are added or removed from the interface, and
    this is accomplished by fib_update_nh_saddrs().

    Signed-off-by: David S. Miller

    David S. Miller