24 Sep, 2015

15 commits


22 Aug, 2015

3 commits

  • - mcast_group: configure the multicast address, now IPv6
    is supported too

    - mcast_port: configure the multicast port

    - mcast_ttl: configure the multicast TTL/HOP_LIMIT

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Allow setups with large MTU to send large sync packets by
    adding sync_maxlen parameter. The default value is now based
    on MTU but no more than 1500 for compatibility reasons.

    To avoid problems if MTU changes allow fragmentation by
    sending packets with DF=0. Problem reported by Dan Carpenter.

    Reported-by: Dan Carpenter
    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • When the sync damon is started we need to hold rtnl
    lock while calling ip_mc_join_group. Currently, we have
    a wrong locking order because the correct one is
    rtnl_lock->__ip_vs_mutex. It is implied from the usage
    of __ip_vs_mutex in ip_vs_dst_event() which is called
    under rtnl lock during NETDEV_* notifications.

    Fix the problem by calling rtnl_lock early only for the
    start_sync_thread call. As a bonus this fixes the usage
    __dev_get_by_name which was not called under rtnl lock.

    This patch actually extends and depends on commit 54ff9ef36bdf
    ("ipv4, ipv6: kill ip_mc_{join, leave}_group and
    ipv6_sock_mc_{join, drop}").

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     

14 Jul, 2015

1 commit


11 May, 2015

2 commits


19 Mar, 2015

1 commit

  • in favor of their inner __ ones, which doesn't grab rtnl.

    As these functions need to operate on a locked socket, we can't be
    grabbing rtnl by then. It's too late and doing so causes reversed
    locking.

    So this patch:
    - move rtnl handling to callers instead while already fixing some
    reversed locking situations, like on vxlan and ipvs code.
    - renames __ ones to not have the __ mark:
    __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
    __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

10 Mar, 2015

1 commit


25 Feb, 2015

1 commit

  • Currently, when TCP/SCTP port reusing happens, IPVS will find the old
    entry and use it for the new one, behaving like a forced persistence.
    But if you consider a cluster with a heavy load of small connections,
    such reuse will happen often and may lead to a not optimal load
    balancing and might prevent a new node from getting a fair load.

    This patch introduces a new sysctl, conn_reuse_mode, that allows
    controlling how to proceed when port reuse is detected. The default
    value will allow rescheduling of new connections only if the old entry
    was in TIME_WAIT state for TCP or CLOSED for SCTP.

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Marcelo Ricardo Leitner
     

23 Feb, 2015

1 commit

  • ip_vs_conn_fill_param_sync() gets in param.pe a module
    reference for persistence engine from __ip_vs_pe_getbyname()
    but forgets to put it. Problem occurs in backup for
    sync protocol v1 (2.6.39).

    Also, pe_data usually comes in sync messages for
    connection templates and ip_vs_conn_new() copies
    the pointer only in this case. Make sure pe_data
    is not leaked if it comes unexpectedly for normal
    connections. Leak can happen only if bogus messages
    are sent to backup server.

    Fixes: fe5e7a1efb66 ("IPVS: Backup, Adding Version 1 receive capability")
    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     

20 Nov, 2014

1 commit


16 Sep, 2014

2 commits

  • The assumption that dest af is equal to service af is now unreliable, so we
    must specify it manually so as not to copy just the first 4 bytes of a v6
    address or doing an illegal read of 16 butes on a v6 address.

    We "lie" in two places: for synchronization (which we will explicitly
    disallow from happening when we have heterogeneous pools) and for black
    hole addresses where there's no real dest.

    Signed-off-by: Alex Gartrell
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Alex Gartrell
     
  • We need to remove the assumption that virtual address family is the same as
    real address family in order to support heterogeneous services (that is,
    services with v4 vips and v6 backends or the opposite).

    Signed-off-by: Alex Gartrell
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Alex Gartrell
     

16 Jul, 2014

1 commit


27 Dec, 2013

1 commit

  • net/netfilter/ipvs/ip_vs_sync.c: In function 'sync_thread_master':
    net/netfilter/ipvs/ip_vs_sync.c:1640:8: warning: unused variable 'ret' [-Wunused-variable]

    Commit 35a2af94c7ce7130ca292c68b1d27fcfdb648f6b ("sched/wait: Make the
    __wait_event*() interface more friendly") changed how the interruption
    state is returned. However, sync_thread_master() ignores this state,
    now causing a compile warning.

    According to Julian Anastasov , this behavior is OK:

    "Yes, your patch looks ok to me. In the past we used ssleep() but IPVS
    users were confused why IPVS threads increase the load average. So, we
    switched to _interruptible calls and later the socket polling was
    added."

    Document this, as requested by Peter Zijlstra, to avoid precious developers
    disappearing in this pitfall in the future.

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Geert Uytterhoeven
     

04 Oct, 2013

1 commit

  • Change all __wait_event*() implementations to match the corresponding
    wait_event*() signature for convenience.

    In particular this does away with the weird 'ret' logic. Since there
    are __wait_event*() users this requires we update them too.

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131002092529.042563462@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

26 Jun, 2013

2 commits

  • Add sync_persist_mode flag to reduce sync traffic
    by syncing only persistent templates.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Convert the SCTP state table, so that it is more readable.
    Change the states to be according to the diagram in RFC 2960
    and add more states suitable for middle box. Still, such
    change in states adds incompatibility if systems in sync
    setup include this change and others do not include it.

    With this change we also have proper transitions in INPUT-ONLY
    mode (DR/TUN) where we see packets only from client. Now
    we should not switch to 10-second CLOSED state at a time
    when we should stay in ESTABLISHED state.

    The short names for states are because we have 16-char space
    in ipvsadm and 11-char limit for the connection list format.
    It is a sequence of the TCP implementation where the longest
    state name is ESTABLISHED.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     

23 Apr, 2013

1 commit

  • struct ip_vs_sync_mesg and ip_vs_sync_mesg_v0 are both sent across the wire
    and used internally to store IPVS synchronisation messages.

    Up until now the scheme used has been to convert the size field
    to network byte order before sending a message on the wire and
    convert it to host byte order when sending a message.

    This patch changes that scheme to always treat the field
    as being network byte order. This seems appropriate as
    the structure is sent across the wire. And by consistently
    treating the field has network byte order it is now possible
    to take advantage of sparse to flag any future miss-use.

    Acked-by: Julian Anastasov
    Acked-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Simon Horman
     

02 Apr, 2013

3 commits

  • We used a global BH disable in LOCAL_OUT hook.
    Add _bh suffix to all places that need it and remove
    the disabling from LOCAL_OUT and sync code.

    Functions like ip_defrag need protection from
    BH, so add it. As for nf_nat_mangle_tcp_packet, it needs
    RCU lock.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • In previous commits the schedulers started to access
    svc->destinations with _rcu list traversal primitives
    because the IP_VS_WAIT_WHILE macro still plays the role of
    grace period. Now it is time to finish the updating part,
    i.e. adding and deleting of dests with _rcu suffix before
    removing the IP_VS_WAIT_WHILE in next commit.

    We use the same rule for conns as for the
    schedulers: dests can be searched in RCU read-side critical
    section where ip_vs_dest_hold can be called by ip_vs_bind_dest.

    Some things are not perfect, for example, calling
    functions like ip_vs_lookup_dest from updating code under
    RCU, just because we use some function both from reader
    and from updater.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • ip_vs_dest_hold will be used under RCU lock
    while ip_vs_dest_put can be called even after dest
    is removed from service, as it happens for conns and
    some schedulers.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     

28 Jan, 2013

1 commit


09 May, 2012

2 commits

  • Allow master and backup servers to use many threads
    for sync traffic. Add sysctl var "sync_ports" to define the
    number of threads. Every thread will use single UDP port,
    thread 0 will use the default port 8848 while last thread
    will use port 8848+sync_ports-1.

    The sync traffic for connections is scheduled to many
    master threads based on the cp address but one connection is
    always assigned to same thread to avoid reordering of the
    sync messages.

    Remove ip_vs_sync_switch_mode because this check
    for sync mode change is still risky. Instead, check for mode
    change under sync_buff_lock.

    Make sure the backup socks do not block on reading.

    Special thanks to Aleksey Chudov for helping in all tests.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Pablo Neira Ayuso
     
  • Add two new sysctl vars to control the sync rate with the
    main idea to reduce the rate for connection templates because
    currently it depends on the packet rate for controlled connections.
    This mechanism should be useful also for normal connections
    with high traffic.

    sync_refresh_period: in seconds, difference in reported connection
    timer that triggers new sync message. It can be used to
    avoid sync messages for the specified period (or half of
    the connection timeout if it is lower) if connection state
    is not changed from last sync.

    sync_retries: integer, 0..3, defines sync retries with period of
    sync_refresh_period/8. Useful to protect against loss of
    sync messages.

    Allow sysctl_sync_threshold to be used with
    sysctl_sync_period=0, so that only single sync message is sent
    if sync_refresh_period is also 0.

    Add new field "sync_endtime" in connection structure to
    hold the reported time when connection expires. The 2 lowest
    bits will represent the retry count.

    As the sysctl_sync_period now can be 0 use ACCESS_ONCE to
    avoid division by zero.

    Special thanks to Aleksey Chudov for being patient with me,
    for his extensive reports and helping in all tests.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Julian Anastasov