13 Dec, 2013

1 commit

  • Reorder struct netns_ct so that atomic_t "count" changes don't
    slowdown users of read mostly fields.

    This is based on Eric Dumazet's proposed patch:
    "netfilter: conntrack: remove the central spinlock"
    http://thread.gmane.org/gmane.linux.network/268758/focus=47306

    The tricky part of cache-aligning this structure, that it is getting
    inlined in struct net (include/net/net_namespace.h), thus changes to
    other netns_xxx structures affects our alignment.

    Eric's original patch contained an ambiguity on 32-bit regarding
    alignment in struct net. This patch also takes 32-bit into account,
    and in case of changed (struct net) alignment sysctl_xxx entries have
    been ordered according to how often they are accessed.

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Jiri Benc
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     

22 Oct, 2013

1 commit


15 Oct, 2013

3 commits

  • This patch registers the ARP family and he filter chain type
    for this family.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This patch adds a batch support to nfnetlink. Basically, it adds
    two new control messages:

    * NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
    the nfgenmsg->res_id indicates the nfnetlink subsystem ID.

    * NFNL_MSG_BATCH_END, that results in the invocation of the
    ss->commit callback function. If not specified or an error
    ocurred in the batch, the ss->abort function is invoked
    instead.

    The end message represents the commit operation in nftables, the
    lack of end message results in an abort. This patch also adds the
    .call_batch function that is only called from the batch receival
    path.

    This patch adds atomic rule updates and dumps based on
    bitmask generations. This allows to atomically commit a set of
    rule-set updates incrementally without altering the internal
    state of existing nf_tables expressions/matches/targets.

    The idea consists of using a generation cursor of 1 bit and
    a bitmask of 2 bits per rule. Assuming the gencursor is 0,
    then the genmask (expressed as a bitmask) can be interpreted
    as:

    00 active in the present, will be active in the next generation.
    01 inactive in the present, will be active in the next generation.
    10 active in the present, will be deleted in the next generation.
    ^
    gencursor

    Once you invoke the transition to the next generation, the global
    gencursor is updated:

    00 active in the present, will be active in the next generation.
    01 active in the present, needs to zero its future, it becomes 00.
    10 inactive in the present, delete now.
    ^
    gencursor

    If a dump is in progress and nf_tables enters a new generation,
    the dump will stop and return -EBUSY to let userspace know that
    it has to retry again. In order to invalidate dumps, a global
    genctr counter is increased everytime nf_tables enters a new
    generation.

    This new operation can be used from the user-space utility
    that controls the firewall, eg.

    nft -f restore

    The rule updates contained in `file' will be applied atomically.

    cat file
    -----
    add filter INPUT ip saddr 1.1.1.1 counter accept #1
    del filter INPUT ip daddr 2.2.2.2 counter drop #2
    -EOF-

    Note that the rule 1 will be inactive until the transition to the
    next generation, the rule 2 will be evicted in the next generation.

    There is a penalty during the rule update due to the branch
    misprediction in the packet matching framework. But that should be
    quickly resolved once the iteration over the commit list that
    contain rules that require updates is finished.

    Event notification happens once the rule-set update has been
    committed. So we skip notifications is case the rule-set update
    is aborted, which can happen in case that the rule-set is tested
    to apply correctly.

    This patch squashed the following patches from Pablo:

    * nf_tables: atomic rule updates and dumps
    * nf_tables: get rid of per rule list_head for commits
    * nf_tables: use per netns commit list
    * nfnetlink: add batch support and use it from nf_tables
    * nf_tables: all rule updates are transactional
    * nf_tables: attach replacement rule after stale one
    * nf_tables: do not allow deletion/replacement of stale rules
    * nf_tables: remove unused NFTA_RULE_FLAGS

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Register family per netnamespace to ensure that sets are
    only visible in its approapriate namespace.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

01 Oct, 2013

1 commit

  • - Move sysctl_local_ports from a global variable into struct netns_ipv4.
    - Modify inet_get_local_port_range to take a struct net, and update all
    of the callers.
    - Move the initialization of sysctl_local_ports into
    sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

    v2:
    - Ensure indentation used tabs
    - Fixed ip.h so it applies cleanly to todays net-next

    v3:
    - Compile fixes of strange callers of inet_get_local_port_range.
    This patch now successfully passes an allmodconfig build.
    Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

    Originally-by: Samya
    Acked-by: Nicolas Dichtel
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

10 Aug, 2013

2 commits


01 Aug, 2013

1 commit

  • Current net name space has only one genid for both IPv4 and IPv6, it has below
    drawbacks:

    - Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
    - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
    entries even when the policy is only applied for one address family.

    Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
    separately in a fine granularity.

    Signed-off-by: Fan Du
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    fan.du
     

23 May, 2013

1 commit


06 Apr, 2013

2 commits

  • This patch adds netns support to nf_log and it prepares netns
    support for existing loggers. It is composed of four major
    changes.

    1) nf_log_register has been split to two functions: nf_log_register
    and nf_log_set. The new nf_log_register is used to globally
    register the nf_logger and nf_log_set is used for enabling
    pernet support from nf_loggers.

    Per netns is not yet complete after this patch, it comes in
    separate follow up patches.

    2) Add net as a parameter of nf_log_bind_pf. Per netns is not
    yet complete after this patch, it only allows to bind the
    nf_logger to the protocol family from init_net and it skips
    other cases.

    3) Adapt all nf_log_packet callers to pass netns as parameter.
    After this patch, this function only works for init_net.

    4) Make the sysctl net/netfilter/nf_log pernet.

    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch makes this proc dentry pernet. So far only init_net
    had a /proc/net/netfilter directory.

    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     

25 Mar, 2013

1 commit

  • This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with
    dev_base_seq to check if a change occurs during a netlink dump.
    If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message
    after the dump was interrupted.

    Note that only dump of unicast addresses is checked (multicast and anycast are
    not checked).

    Reported-by: Junwei Zhang
    Reported-by: Hongjun Li
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

06 Feb, 2013

1 commit

  • The xfrm gc threshold can be configured via xfrm{4,6}_gc_thresh
    sysctl but currently only in init_net, other namespaces always
    use the default value. This can substantially limit the number
    of IPsec tunnels that can be effectively used.

    Signed-off-by: Michal Kubecek
    Signed-off-by: Steffen Klassert

    Michal Kubecek
     

18 Jan, 2013

1 commit

  • similar to connmarks, except labels are bit-based; i.e.
    all labels may be attached to a flow at the same time.

    Up to 128 labels are supported. Supporting more labels
    is possible, but requires increasing the ct offset delta
    from u8 to u16 type due to increased extension sizes.

    Mapping of bit-identifier to label name is done in userspace.

    The extension is enabled at run-time once "-m connlabel" netfilter
    rules are added.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

16 Jan, 2013

1 commit


07 Jan, 2013

1 commit

  • As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl
    namespace aware. The reason behind this patch is to ease the testing
    of ecn problems on the internet and allows applications to tune their
    own use of ecn.

    Cc: Eric Dumazet
    Cc: David Miller
    Cc: Stephen Hemminger
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

24 Dec, 2012

1 commit

  • Florian Westphal reported that the removal of the NOTRACK target
    (9655050 netfilter: remove xt_NOTRACK) is breaking some existing
    setups.

    That removal was scheduled for removal since long time ago as
    described in Documentation/feature-removal-schedule.txt

    What: xt_NOTRACK
    Files: net/netfilter/xt_NOTRACK.c
    When: April 2011
    Why: Superseded by xt_CT

    Still, people may have not notice / may have decided to stick to an
    old iptables version. I agree with him in that some more conservative
    approach by spotting some printk to warn users for some time is less
    agressive.

    Current iptables 1.4.16.3 already contains the aliasing support
    that makes it point to the CT target, so upgrading would fix it.
    Still, the policy so far has been to avoid pushing our users to
    upgrade.

    As a solution, this patch recovers the NOTRACK target inside the CT
    target and it now spots a warning.

    Reported-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

17 Dec, 2012

1 commit

  • In (d871bef netfilter: ctnetlink: dump entries from the dying and
    unconfirmed lists), we assume that all conntrack objects are
    inserted in any of the existing lists. However, template conntrack
    objects were not. This results in hitting BUG_ON in the
    destroy_conntrack path while removing a rule that uses the CT target.

    This patch fixes the situation by adding the template lists, which
    is where template conntrack objects reside now.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

26 Oct, 2012

1 commit

  • Currently sctp allows for the optional use of md5 of sha1 hmac algorithms to
    generate cookie values when establishing new connections via two build time
    config options. Theres no real reason to make this a static selection. We can
    add a sysctl that allows for the dynamic selection of these algorithms at run
    time, with the default value determined by the corresponding crypto library
    availability.
    This comes in handy when, for example running a system in FIPS mode, where use
    of md5 is disallowed, but SHA1 is permitted.

    Note: This new sysctl has no corresponding socket option to select the cookie
    hmac algorithm. I chose not to implement that intentionally, as RFC 6458
    contains no option for this value, and I opted not to pollute the socket option
    namespace.

    Change notes:
    v2)
    * Updated subject to have the proper sctp prefix as per Dave M.
    * Replaced deafult selection options with new options that allow
    developers to explicitly select available hmac algs at build time
    as per suggestion by Vlad Y.

    Signed-off-by: Neil Horman
    CC: Vlad Yasevich
    CC: "David S. Miller"
    CC: netdev@vger.kernel.org
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Neil Horman
     

29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Sep, 2012

1 commit

  • As pointed by Michal, it is necessary to add a new
    namespace for nf_conntrack_reasm code, this prepares
    for the second patch.

    Cc: Herbert Xu
    Cc: Michal Kubeček
    Cc: David Miller
    Cc: Patrick McHardy
    Cc: Pablo Neira Ayuso
    Cc: netfilter-devel@vger.kernel.org
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Amerigo Wang
     

19 Sep, 2012

1 commit


03 Sep, 2012

1 commit


30 Aug, 2012

2 commits


25 Aug, 2012

1 commit


24 Aug, 2012

1 commit

  • This patch fixes a broken build due to a missing header:
    ...
    CC net/ipv4/proc.o
    In file included from include/net/net_namespace.h:15,
    from net/ipv4/proc.c:35:
    include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type
    ...

    The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock,
    but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well.

    See commit 0fa7fa98dbcc2789409ed24e885485e645803d7f:
    packet: Protect packet sk list with mutex (v2) patch,

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

23 Aug, 2012

1 commit

  • Change since v1:

    * Fixed inuse counters access spotted by Eric

    In patch eea68e2f (packet: Report socket mclist info via diag module) I've
    introduced a "scheduling in atomic" problem in packet diag module -- the
    socket list is traversed under rcu_read_lock() while performed under it sk
    mclist access requires rtnl lock (i.e. -- mutex) to be taken.

    [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
    [152363.820573] 4 locks held by crtools/12517:
    [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
    [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
    [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
    [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

    Similar thing was then re-introduced by further packet diag patches (fanount
    mutex and pgvec mutex for rings) :(

    Apart from being terribly sorry for the above, I propose to change the packet
    sk list protection from spinlock to mutex. This lock currently protects two
    modifications:

    * sklist
    * prot inuse counters

    The sklist modifications can be just reprotected with mutex since they already
    occur in a sleeping context. The inuse counters modifications are trickier -- the
    __this_cpu_-s are used inside, thus requiring the caller to handle the potential
    issues with contexts himself. Since packet sockets' counters are modified in two
    places only (packet_create and packet_release) we only need to protect the context
    from being preempted. BH disabling is not required in this case.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

15 Aug, 2012

7 commits


31 Jul, 2012

1 commit


21 Jul, 2012

1 commit

  • Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not
    guaranteed to be a power of two.

    Uses hash_32() instead of custom hash.

    tcpmhash_entries should be an unsigned int.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jul, 2012

1 commit

  • tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
    per network namespace.

    This leads to bad behavior on multiqueue NICS, because many cpus
    contend for the socket lock and once socket lock is acquired, extra
    false sharing on various socket fields slow down the operations.

    To better resist to attacks, we use a percpu socket. Each cpu can
    run without contention, using appropriate memory (local node)

    Additional features :

    1) We also mirror the queue_mapping of the incoming skb, so that
    answers use the same queue if possible.

    2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

    3) We now limit the number of in-flight RST/ACK [1] packets
    per cpu, instead of per namespace, and we honor the sysctl_wmem_default
    limit dynamically. (Prior to this patch, sysctl_wmem_default value was
    copied at boot time, so any further change would not affect tcp_sock
    limit)

    [1] These packets are only generated when no socket was matched for
    the incoming packet.

    Reported-by: Bill Sommerfeld
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

1 commit

  • Maintain a local hash table of TCP dynamic metrics blobs.

    Computed TCP metrics are no longer maintained in the route metrics.

    The table uses RCU and an extremely simple hash so that it has low
    latency and low overhead. A simple hash is legitimate because we only
    make metrics blobs for fully established connections.

    Some tweaking of the default hash table sizes, metric timeouts, and
    the hash chain length limit certainly could use some tweaking. But
    the basic design seems sound.

    With help from Eric Dumazet and Joe Perches.

    Signed-off-by: David S. Miller

    David S. Miller