13 Dec, 2011

1 commit

  • This patch allows each namespace to independently set up
    its levels for tcp memory pressure thresholds. This patch
    alone does not buy much: we need to make this values
    per group of process somehow. This is achieved in the
    patches that follows in this patchset.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: David S. Miller
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     

12 Dec, 2011

1 commit


03 Dec, 2011

1 commit


22 Nov, 2011

1 commit

  • This patch fixes an oops that can be triggered following this recipe:

    0) make sure nf_conntrack_netlink and nf_conntrack_ipv4 are loaded.
    1) container is started.
    2) connect to it via lxc-console.
    3) generate some traffic with the container to create some conntrack
    entries in its table.
    4) stop the container: you hit one oops because the conntrack table
    cleanup tries to report the destroy event to user-space but the
    per-netns nfnetlink socket has already gone (as the nfnetlink
    socket is per-netns but event callback registration is global).

    To fix this situation, we make the ctnl_notifier per-netns so the
    callback is registered/unregistered if the container is
    created/destroyed.

    Alex Bligh and Alexey Dobriyan originally proposed one small patch to
    check if the nfnetlink socket is gone in nfnetlink_has_listeners,
    but this is a very visited path for events, thus, it may reduce
    performance and it looks a bit hackish to check for the nfnetlink
    socket only to workaround this situation. As a result, I decided
    to follow the bigger path choice, which seems to look nicer to me.

    Cc: Alexey Dobriyan
    Reported-by: Alex Bligh
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

14 Nov, 2011

1 commit

  • Reading /proc/net/snmp6 on a machine with a lot of cpus is very
    expensive (can be ~88000 us).

    This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
    values for all possible cpus can read 16 Mbytes of memory (32MBytes on
    non x86 arches)

    ICMP messages are not considered as fast path on a typical server, and
    eventually few cpus handle them anyway. We can afford an atomic
    operation instead of using percpu data.

    This saves 4096 bytes per cpu and per network namespace.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Nov, 2011

1 commit

  • Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
    (can be ~88000 us).

    This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
    for all possible cpus can read 16 Mbytes of memory.

    ICMP messages are not considered as fast path on a typical server, and
    eventually few cpus handle them anyway. We can afford an atomic
    operation instead of using percpu data.

    This saves 4096 bytes per cpu and per network namespace.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

14 May, 2011

1 commit

  • This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
    ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
    without any special privileges. In other words, the patch makes it
    possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
    order not to increase the kernel's attack surface, the new functionality
    is disabled by default, but is enabled at bootup by supporting Linux
    distributions, optionally with restriction to a group or a group range
    (see below).

    Similar functionality is implemented in Mac OS X:
    http://www.manpagez.com/man/4/icmp/

    A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

    Message identifiers (octets 4-5 of ICMP header) are interpreted as local
    ports. Addresses are stored in struct sockaddr_in. No port numbers are
    reserved for privileged processes, port 0 is reserved for API ("let the
    kernel pick a free number"). There is no notion of remote ports, remote
    port numbers provided by the user (e.g. in connect()) are ignored.

    Data sent and received include ICMP headers. This is deliberate to:
    1) Avoid the need to transport headers values like sequence numbers by
    other means.
    2) Make it easier to port existing programs using raw sockets.

    ICMP headers given to send() are checked and sanitized. The type must be
    ICMP_ECHO and the code must be zero (future extensions might relax this,
    see below). The id is set to the number (local port) of the socket, the
    checksum is always recomputed.

    ICMP reply packets received from the network are demultiplexed according
    to their id's, and are returned by recv() without any modifications.
    IP header information and ICMP errors of those packets may be obtained
    via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
    quenches and redirects are reported as fake errors via the error queue
    (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
    network order).

    socket(2) is restricted to the group range specified in
    "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
    that nobody (not even root) may create ping sockets. Setting it to "100
    100" would grant permissions to the single group (to either make
    /sbin/ping g+s and owned by this group or to grant permissions to the
    "netadmins" group), "0 4294967295" would enable it for the world, "100
    4294967295" would enable it for the users, but not daemons.

    The existing code might be (in the unlikely case anyone needs it)
    extended rather easily to handle other similar pairs of ICMP messages
    (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
    etc.).

    Userspace ping util & patch for it:
    http://openwall.info/wiki/people/segoon/ping

    For Openwall GNU/*/Linux it was the last step on the road to the
    setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
    is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
    http://mirrors.kernel.org/openwall/Owl/current/iso/

    Initially this functionality was written by Pavel Kankovsky for
    Linux 2.4.32, but unfortunately it was never made public.

    All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
    the patch.

    PATCH v3:
    - switched to flowi4.
    - minor changes to be consistent with raw sockets code.

    PATCH v2:
    - changed ping_debug() to pr_debug().
    - removed CONFIG_IP_PING.
    - removed ping_seq_fops.owner field (unused for procfs).
    - switched to proc_net_fops_create().
    - switched to %pK in seq_printf().

    PATCH v1:
    - fixed checksumming bug.
    - CAP_NET_RAW may not create icmp sockets anymore.

    RFC v2:
    - minor cleanups.
    - introduced sysctl'able group range to restrict socket(2).

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: David S. Miller

    Vasiliy Kulikov
     

25 Mar, 2011

1 commit

  • Any operation that:

    1) Brings up an interface
    2) Adds an IP address to an interface
    3) Deletes an IP address from an interface

    can potentially invalidate the nh_saddr value, requiring
    it to be recomputed.

    Perform the recomputation lazily using a generation ID.

    Reported-by: Julian Anastasov
    Signed-off-by: David S. Miller

    David S. Miller
     

15 Mar, 2011

1 commit

  • Remove include/net/netns/ip_vs.h because it depends on
    structures from include/net/ip_vs.h. As ipvs is pointer in
    struct net it is better to move struct netns_ipvs into
    include/net/ip_vs.h, so that we can easily use other structures
    in struct netns_ipvs.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     

19 Jan, 2011

1 commit

  • This patch adds flow-based timestamping for conntracks. This
    conntrack extension is disabled by default. Basically, we use
    two 64-bits variables to store the creation timestamp once the
    conntrack has been confirmed and the other to store the deletion
    time. This extension is disabled by default, to enable it, you
    have to:

    echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp

    This patch allows to save memory for user-space flow-based
    loogers such as ulogd2. In short, ulogd2 does not need to
    keep a hashtable with the conntrack in user-space to know
    when they were created and destroyed, instead we use the
    kernel timestamp. If we want to have a sane IPFIX implementation
    in user-space, this nanosecs resolution timestamps are also
    useful. Other custom user-space applications can benefit from
    this via libnetfilter_conntrack.

    This patch modifies the /proc output to display the delta time
    in seconds since the flow start. You can also obtain the
    flow-start date by means of the conntrack-tools.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     

14 Jan, 2011

1 commit


13 Jan, 2011

17 commits

  • Last two global vars to be moved,
    ip_vs_ftpsvc_counter and ip_vs_nullsvc_counter.

    [horms@verge.net.au: removed whitespace-change-only hunk]
    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • trash list per namspace,
    and reordering of some params in dst struct.

    [ horms@verge.net.au: Use cancel_delayed_work_sync() instead of
    cancel_rearming_delayed_work(). Found during
    merge conflict resoliution ]
    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • This patch makes defense work timer per name-space,
    A net ptr had to be added to the ipvs struct,
    since it's needed by defense_work_handler.

    [ horms@verge.net.au: Use cancel_delayed_work_sync() instead of
    cancel_rearming_delayed_work(). Found during
    merge conflict resoliution ]
    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • Moving global vars to ipvs struct, except for svc table lock.
    Next patch for ctl will be drop-rate handling.

    *v3
    __ip_vs_mutex remains global
    ip_vs_conntrack_enabled(struct netns_ipvs *ipvs)

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • Connection hash table is now name space aware.
    i.e. net ptr >> 8 is xor:ed to the hash,
    and this is the first param to be compared.
    The net struct is 0xa40 in size ( a little bit smaller for 32 bit arch:s)
    and cache-line aligned, so a ptr >> 5 might be a more clever solution ?

    All lookups where net is compared uses net_eq() which returns 1 when netns
    is disabled, and the compiler seems to do something clever in that case.

    ip_vs_conn_fill_param() have *net as first param now.

    Three new inlines added to keep conn struct smaller
    when names space is disabled.
    - ip_vs_conn_net()
    - ip_vs_conn_net_set()
    - ip_vs_conn_net_eq()

    *v3
    moved net compare to the end in "fast path"

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • The statistic counter locks for every packet are now removed,
    and that statistic is now per CPU, i.e. no locks needed.
    However summing is made in ip_vs_est into ip_vs_stats struct
    which is moved to ipvs struc.

    procfs, ip_vs_stats now have a "per cpu" count and a grand total.
    A new function seq_file_single_net() in ip_vs.h created for handling of
    single_open_net() since it does not place net ptr in a struct, like others.

    /var/lib/lxc # cat /proc/net/ip_vs_stats_percpu
    Total Incoming Outgoing Incoming Outgoing
    CPU Conns Packets Packets Bytes Bytes
    0 0 3 1 9D 34
    1 0 1 2 49 70
    2 0 1 2 34 76
    3 1 2 2 70 74
    ~ 1 7 7 18A 18E

    Conns/s Pkts/s Pkts/s Bytes/s Bytes/s
    0 0 0 0 0

    *v3
    ip_vs_stats reamains as before, instead ip_vs_stats_percpu is added.
    u64 seq lock added

    *v4
    Bug correction inbytes and outbytes as own vars..
    per_cpu counter for all stats now as suggested by Julian.

    [horms@verge.net.au: removed whitespace-change-only hunk]
    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • All global variables moved to struct ipvs,
    most external changes fixed (i.e. init_net removed)
    in sync_buf create + 4 replaced by sizeof(struct..)

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • All variables moved to struct ipvs,
    most external changes fixed (i.e. init_net removed)

    *v3
    timer per ns instead of a common timer in estimator.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • All variables moved to struct ipvs,
    most external changes fixed (i.e. init_net removed)

    in ip_vs_protocol param struct net *net added to:
    - register_app()
    - unregister_app()
    This affected almost all proto_xxx.c files

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • In this phase (one), all local vars will be moved to ipvs struct.

    Remaining work, add param struct net *net to a couple of
    functions that is common for all protos and use ip_vs_proto_data

    *v3
    Removed unuset function set_state_timeout()

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • In this phase (one), all local vars will be moved to ipvs struct.

    Remaining work, add param struct net *net to a couple of
    functions that is common for all protos and use ip_vs_proto_data

    *v3
    Removed unused function set_state_timeout()

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • In this phase (one), all local vars will be moved to ipvs struct.

    Remaining work, add param struct net *net to a couple of
    functions that is common for all protos and use all
    ip_vs_proto_data

    *v3
    Removed unused function as sugested by Simon

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • Add support for protocol data per name-space.
    in struct ip_vs_protocol, appcnt will be removed when all protos
    are modified for network name-space.

    This patch causes warnings of unused functions, they will be used
    when next patch will be applied.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • var sysctl_ip_vs_lblc_expiration moved to ipvs struct as
    sysctl_lblc_expiration

    procfs updated to handle this.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • var sysctl_ip_vs_lblcr_expiration moved to ipvs struct as
    sysctl_lblcr_expiration

    procfs updated to handle this.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • Services hash tables got netns ptr a hash arg,
    While Real Servers (rs) has been moved to ipvs struct.
    Two new inline functions added to get net ptr from skb.

    Since ip_vs is called from different contexts there is two
    places to dig for the net ptr skb->dev or skb->sk
    this is handled in skb_net() and skb_sknet()

    Global functions, ip_vs_service_get() ip_vs_lookup_real_service()
    etc have got struct net *net as first param.
    If possible get net ptr skb etc,
    - if not &init_net is used at this early stage of patching.

    ip_vs_ctl.c procfs not ready for netns yet.

    *v3
    Comments by Julian
    - __ip_vs_service_find and __ip_vs_svc_fwm_find are fast path,
    net_eq(svc->net, net) so the check is at the end now.
    - net = skb_net(skb) in ip_vs_out moved after check for skb_dst.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     
  • Preparation for network name-space init, in this stage
    some empty functions exists.

    In most files there is a check if it is root ns i.e. init_net
    if (!net_eq(net, &init_net))
    return ...
    this will be removed by the last patch, when enabling name-space.

    *v3
    ip_vs_conn.c merge error corrected.
    net_ipvs #ifdef removed as sugested by Jan Engelhardt

    [ horms@verge.net.au: Removed whitespace-change-only hunks ]
    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Hans Schillstrom
     

22 Nov, 2010

1 commit


18 Oct, 2010

1 commit

  • In a network bench, I noticed an unfortunate false sharing between
    'loopback_dev' and 'count' fields in "struct net".

    'count' is written each time a socket is created or destroyed, while
    loopback_dev might be often read in routing code.

    Move loopback_dev in a read mostly section of "struct net"

    Note: struct netns_xfrm is cache line aligned on SMP.
    (It contains a "struct dst_ops")
    Move it at the end to avoid holes, and reduce sizeof(struct net) by 128
    bytes on ia32.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 May, 2010

4 commits

  • This patch adds support for multiple independant multicast routing instances,
    named "tables".

    Userspace multicast routing daemons can bind to a specific table instance by
    issuing a setsockopt call using a new option MRT6_TABLE. The table number is
    stored in the raw socket data and affects all following ip6mr setsockopt(),
    getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
    is created with a default routing rule pointing to it. Newly created pim6reg
    devices have the table number appended ("pim6regX"), with the exception of
    devices created in the default table, which are named just "pim6reg" for
    compatibility reasons.

    Packets are directed to a specific table instance using routing rules,
    similar to how regular routing rules work. Currently iif, oif and mark
    are supported as keys, source and destination addresses could be supported
    additionally.

    Example usage:

    - bind pimd/xorp/... to a specific table:

    uint32_t table = 123;
    setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));

    - create routing rules directing packets to the new table:

    # ip -6 mrule add iif eth0 lookup 123
    # ip -6 mrule add oif eth0 lookup 123

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • The unres_queue is currently shared between all namespaces. Following patches
    will additionally allow to create multiple multicast routing tables in each
    namespace. Having a single shared queue for all these users seems to excessive,
    move the queue and the cleanup timer to the per-namespace data to unshare it.

    As a side-effect, this fixes a bug in the seq file iteration functions: the
    first entry returned is always from the current namespace, entries returned
    after that may belong to any namespace.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

08 May, 2010

1 commit

  • A while back there was a discussion regarding the rt_secret_interval timer.
    Given that we've had the ability to do emergency route cache rebuilds for awhile
    now, based on a statistical analysis of the various hash chain lengths in the
    cache, the use of the flush timer is somewhat redundant. This patch removes the
    rt_secret_interval sysctl, allowing us to rely solely on the statistical
    analysis mechanism to determine the need for route cache flushes.

    Signed-off-by: Neil Horman
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

28 Apr, 2010

1 commit


14 Apr, 2010

3 commits

  • This patch adds support for multiple independant multicast routing instances,
    named "tables".

    Userspace multicast routing daemons can bind to a specific table instance by
    issuing a setsockopt call using a new option MRT_TABLE. The table number is
    stored in the raw socket data and affects all following ipmr setsockopt(),
    getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT)
    is created with a default routing rule pointing to it. Newly created pimreg
    devices have the table number appended ("pimregX"), with the exception of
    devices created in the default table, which are named just "pimreg" for
    compatibility reasons.

    Packets are directed to a specific table instance using routing rules,
    similar to how regular routing rules work. Currently iif, oif and mark
    are supported as keys, source and destination addresses could be supported
    additionally.

    Example usage:

    - bind pimd/xorp/... to a specific table:

    uint32_t table = 123;
    setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table));

    - create routing rules directing packets to the new table:

    # ip mrule add iif eth0 lookup 123
    # ip mrule add oif eth0 lookup 123

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy