06 Apr, 2013

2 commits

  • This patch adds netns support to nf_log and it prepares netns
    support for existing loggers. It is composed of four major
    changes.

    1) nf_log_register has been split to two functions: nf_log_register
    and nf_log_set. The new nf_log_register is used to globally
    register the nf_logger and nf_log_set is used for enabling
    pernet support from nf_loggers.

    Per netns is not yet complete after this patch, it comes in
    separate follow up patches.

    2) Add net as a parameter of nf_log_bind_pf. Per netns is not
    yet complete after this patch, it only allows to bind the
    nf_logger to the protocol family from init_net and it skips
    other cases.

    3) Adapt all nf_log_packet callers to pass netns as parameter.
    After this patch, this function only works for init_net.

    4) Make the sysctl net/netfilter/nf_log pernet.

    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch makes this proc dentry pernet. So far only init_net
    had a /proc/net/netfilter directory.

    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     

02 Apr, 2013

38 commits

  • Signed-off-by: Michal Kubecek
    Signed-off-by: Pablo Neira Ayuso

    Michal Kubeček
     
  • Because rev1 and rev3 of the target share the same hashing
    generalize it by introduing nfqueue_hash().

    Signed-off-by: Holger Eitzenberger
    Signed-off-by: Pablo Neira Ayuso

    holger@eitzenberger.org
     
  • Current NFQUEUE target uses a hash, computed over source and
    destination address (and other parameters), for steering the packet
    to the actual NFQUEUE. This, however forgets about the fact that the
    packet eventually is handled by a particular CPU on user request.

    If E. g.

    1) IRQ affinity is used to handle packets on a particular CPU already
    (both single-queue or multi-queue case)

    and/or

    2) RPS is used to steer packets to a specific softirq

    the target easily chooses an NFQUEUE which is not handled by a process
    pinned to the same CPU.

    The idea is therefore to use the CPU index for determining the
    NFQUEUE handling the packet.

    E. g. when having a system with 4 CPUs, 4 MQ queues and 4 NFQUEUEs it
    looks like this:

    +-----+ +-----+ +-----+ +-----+
    |NFQ#0| |NFQ#1| |NFQ#2| |NFQ#3|
    +-----+ +-----+ +-----+ +-----+
    ^ ^ ^ ^
    | |NFQUEUE | |
    + + + +
    +-----+ +-----+ +-----+ +-----+
    |rx-0 | |rx-1 | |rx-2 | |rx-3 |
    +-----+ +-----+ +-----+ +-----+

    The NFQUEUEs not necessarily have to start with number 0, setups with
    less NFQUEUEs than packet-handling CPUs are not a problem as well.

    This patch extends the NFQUEUE target to accept a new
    NFQ_FLAG_CPU_FANOUT flag. If this is specified the target uses the
    CPU index for determining the NFQUEUE being used. I have to introduce
    rev3 for this. The 'flags' are folded into _v2 'bypass'.

    By changing the way which queue is assigned, I'm able to improve the
    performance if the processes reading on the NFQUEUs are pinned
    correctly.

    Signed-off-by: Holger Eitzenberger
    Signed-off-by: Pablo Neira Ayuso

    holger@eitzenberger.org
     
  • Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • We used a global BH disable in LOCAL_OUT hook.
    Add _bh suffix to all places that need it and remove
    the disabling from LOCAL_OUT and sync code.

    Functions like ip_defrag need protection from
    BH, so add it. As for nf_nat_mangle_tcp_packet, it needs
    RCU lock.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • This is the final step in RCU conversion.

    Things that are removed:

    - svc->usecnt: now svc is accessed under RCU read lock
    - svc->inc: and some unused code
    - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
    - __ip_vs_svc_lock: replaced with RCU
    - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
    RCU and work in parallel with configuration

    Other changes:

    - before now, a RCU read-side critical section included the
    calling of the schedule method, now it is extended to include
    service lookup
    - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
    - svc->pe and svc->scheduler remain to the end (of grace period),
    the schedulers are prepared for such RCU readers
    even after done_service is called but they need
    to use synchronize_rcu because last ip_vs_scheduler_put
    can happen while RCU read-side critical sections
    use an outdated svc->scheduler pointer
    - as planned, update_service is removed
    - empty services can be freed immediately after grace period.
    If dests were present, the services are freed from
    the dest trash code

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • In previous commits the schedulers started to access
    svc->destinations with _rcu list traversal primitives
    because the IP_VS_WAIT_WHILE macro still plays the role of
    grace period. Now it is time to finish the updating part,
    i.e. adding and deleting of dests with _rcu suffix before
    removing the IP_VS_WAIT_WHILE in next commit.

    We use the same rule for conns as for the
    schedulers: dests can be searched in RCU read-side critical
    section where ip_vs_dest_hold can be called by ip_vs_bind_dest.

    Some things are not perfect, for example, calling
    functions like ip_vs_lookup_dest from updating code under
    RCU, just because we use some function both from reader
    and from updater.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • As all read_locks are gone spin lock is preferred.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • This method releases the scheduler state,
    it can not fail. Such change will help to properly
    replace the scheduler in following patch.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • All dests will go to trash, no exceptions.
    But we have to use new list node t_list for this, due
    to RCU changes in following patches. Dests will wait there
    initial grace period and later all conns and schedulers to
    put their reference. The dests don't get reference for
    staying in dest trash as before.

    As result, we do not load ip_vs_dest_put with
    extra checks for last refcnt and the schedulers do not
    need to play games with atomic_inc_not_zero while
    selecting best destination.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations. As the weight for some
    dest can be reduced during dest selection, change the
    algorithm to check weights by using minimum weights in the
    1 .. max_weight-(di-1) range, with the same step (di). By this
    way we ensure that there will be always a weight >= 1 check
    before claiming that all destinations are overloaded.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Use the 3 new methods to reassign dests.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations. As the previous entry
    could be unlinked, limit the list traversals to 2 when
    lookup started from previous entry.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations. The read_lock for sched_lock is
    removed. The set.lock is removed because now it is used in
    rare cases, mostly under sched_lock.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The schedule method now needs _rcu list-traversal
    primitive for svc->destinations. The read_lock for sched_lock is
    removed. Use a dead flag to prevent new entries to be created
    while scheduler is reclaimed. Use hlist for the hash table.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Use the new add_dest and del_dest methods
    to reassign dests.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • ip_vs_dest_hold will be used under RCU lock
    while ip_vs_dest_put can be called even after dest
    is removed from service, as it happens for conns and
    some schedulers.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Allow schedulers to use rcu_dereference when
    returning destination on lookup. The RCU read-side critical
    section will allow ip_vs_bind_dest to get dest refcnt as
    preparation for the step where destinations will be
    deleted without an IP_VS_WAIT_WHILE guard that holds the
    packet processing during update.

    Add new optional scheduler methods add_dest,
    del_dest and upd_dest. For now the methods are called
    together with update_service but update_service will be
    removed in a following change.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The global list with schedulers ip_vs_schedulers
    is accessed only from user context - configuration and
    scheduler module [un]registration. Use ip_vs_sched_mutex
    instead.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • We have many fields to set and few to reset,
    use kmem_cache_alloc instead to save some cycles.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • __ip_vs_conn_in_get and ip_vs_conn_out_get are
    hot places. Optimize them, so that ports are matched first.
    By moving net and fwmark below, on 32-bit arch we can fit
    caddr in 32-byte cache line and all addresses in 64-byte
    cache line.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Convert __ip_vs_conntbl_lock_array as follows:

    - readers that do not modify conn lists will use RCU lock
    - updaters that modify lists will use spinlock_t

    Now for conn lookups we will use RCU read-side
    critical section. Without using __ip_vs_conn_get such
    places have access to connection fields and can
    dereference some pointers like pe and pe_data plus
    the ability to update timer expiration. If full access
    is required we contend for reference.

    We add barrier in __ip_vs_conn_put, so that
    other CPUs see the refcnt operation after other writes.

    With the introduction of ip_vs_conn_unlink()
    we try to reorganize ip_vs_conn_expire(), so that
    unhashing of connections that should stay more time is
    avoided, even if it is for very short time.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Allow the readers to use RCU lock and for
    PE module registrations use global mutex instead of
    spinlock. All PE modules need to use synchronize_rcu
    in their module exit handler.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • rs_lock was used to protect rs_table (hash table)
    from updaters (under global mutex) and readers (packet handlers).
    We can remove rs_lock by using RCU lock for readers. Reclaiming
    dest only with kfree_rcu is enough because the readers access
    only fields from the ip_vs_dest structure.

    Use hlist for rs_table.

    As we are now using hlist_del_rcu, introduce in_rs_table
    flag as replacement for the list_empty checks which do not
    work with RCU. It is needed because only NAT dests are in
    the rs_table.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • We use locks like tcp_app_lock, udp_app_lock,
    sctp_app_lock to protect access to the protocol hash tables
    from readers in packet context while the application
    instances (inc) are [un]registered under global mutex.

    As the hash tables are mostly read when conns are
    created and bound to app, use RCU for readers and reclaim
    app instance after grace period.

    Simplify ip_vs_app_inc_get because we use usecnt
    only for statistics and rely on module refcounting.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Currently when forwarding requests to real servers
    we use dst_lock and atomic operations when cloning the
    dst_cache value. As the dst_cache value does not change
    most of the time it is better to use RCU and to lock
    dst_lock only when we need to replace the obsoleted dst.
    For this to work we keep dst_cache in new structure protected
    by RCU. For packets to remote real servers we will use noref
    version of dst_cache, it will be valid while we are in RCU
    read-side critical section because now dst_release for replaced
    dsts will be invoked after the grace period. Packets to
    local real servers that are passed to local stack with
    NF_ACCEPT need a dst clone.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Consolidate the PMTU checks, ICMP sending and
    skb_dst modification in __ip_vs_get_out_rt and
    __ip_vs_get_out_rt_v6. Now skb_dst is changed early
    to simplify the transmitters.

    Make sure update_pmtu is called only for local clients.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • We run in contexts like ip_rcv, ipv6_rcv, br_handle_frame,
    do not expect shared skbs.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • After commit 70e7341673 (ipv4: Show that ip_send_reply()
    is purely unicast routine.) we do not need to reroute DNAT-ed
    traffic over loopback because reply uses iph daddr and not
    rt_spec_dst.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Move and give better names to two functions:

    - ip_vs_dst_reset to __ip_vs_dst_cache_reset
    - __ip_vs_dev_reset to ip_vs_forget_dev

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • It was a bad idea to hide return statements in macros.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • The real server becomes unreachable on down event,
    no need to wait device unregistration. Should help in
    releasing dsts early before dst->dev is replaced with lo.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Avoid replacing the cached route for real server
    on every packet with different TOS. I doubt that routing
    by TOS for real server is used at all, so we should be
    better with such optimization.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Rename skb_dst_set_noref to __skb_dst_set_noref and
    add force flag as suggested by David Miller. The new wrapper
    skb_dst_set_noref_force will force dst entries that are not
    cached to be attached as skb dst without taking reference
    as long as provided dst is reclaimed after RCU grace period.

    Signed-off-by: Julian Anastasov
    Signed-off by: Hans Schillstrom
    Acked-by: David S. Miller
    Signed-off-by: Simon Horman

    Julian Anastasov