09 May, 2012

24 commits

  • This patch removes ip_queue support which was marked as obsolete
    years ago. The nfnetlink_queue modules provides more advanced
    user-space packet queueing mechanism.

    This patch also removes capability code included in SELinux that
    refers to ip_queue. Otherwise, we break compilation.

    Several warning has been sent regarding this to the mailing list
    in the past month without anyone rising the hand to stop this
    with some strong argument.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Explicit helper attachment via the CT target is broken with NAT
    if non-standard ports are used. This problem was hidden behind
    the automatic helper assignment routine. Thus, it becomes more
    noticeable now that we can disable the automatic helper assignment
    with Eric Leblond's:

    9e8ac5a netfilter: nf_ct_helper: allow to disable automatic helper assignment

    Basically, nf_conntrack_alter_reply asks for looking up the helper
    up if NAT is enabled. Unfortunately, we don't have the conntrack
    template at that point anymore.

    Since we don't want to rely on the automatic helper assignment,
    we can skip the second look-up and stick to the helper that was
    attached by iptables. With the CT target, the user is in full
    control of helper attachment, thus, the policy is to trust what
    the user explicitly configures via iptables (no automatic magic
    anymore).

    Interestingly, this bug was hidden by the automatic helper look-up
    code. But it can be easily trigger if you attach the helper in
    a non-standard port, eg.

    iptables -I PREROUTING -t raw -p tcp --dport 8888 \
    -j CT --helper ftp

    And you disabled the automatic helper assignment.

    I added the IPS_HELPER_BIT that allows us to differenciate between
    a helper that has been explicitly attached and those that have been
    automatically assigned. I didn't come up with a better solution
    (having backward compatibility in mind).

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This refreshes the "timeout" attribute in existing expectations if one is
    given.

    The use case for this would be for userspace helpers to extend the lifetime
    of the expectation when requested, as this is not possible right now
    without deleting/recreating the expectation.

    I use this specifically for forwarding DCERPC traffic through:

    DCERPC has a port mapper daemon that chooses a (seemingly) random port for
    future traffic to go to. We expect this traffic (with a reasonable
    timeout), but sometimes the port mapper will tell the client to continue
    using the same port. This allows us to extend the expectation accordingly.

    Signed-off-by: Kelvie Wong
    Signed-off-by: Pablo Neira Ayuso

    Kelvie Wong
     
  • To build ip_vs as a module sysctl_rmem_max and sysctl_wmem_max
    needs to be exported.

    The dependency was added by "ipvs: wakeup master thread" patch.

    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman
    Acked-by: David S. Miller
    Signed-off-by: Pablo Neira Ayuso

    Hans Schillstrom
     
  • Functions not referenced outside of a source file should be marked
    static to prevent it from being exposed globally.

    This quiets the sparse warnings:

    warning: symbol '__ipvs_proto_data_get' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Simon Horman

    H Hartley Sweeten
     
  • Functions not referenced outside of a source file should be marked
    static to prevent it from being exposed globally.

    This quiets the sparse warnings:

    warning: symbol 'ip_vs_ftp_init' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Simon Horman

    H Hartley Sweeten
     
  • cp->flags is marked volatile but ip_vs_bind_dest
    can safely modify the flags, so save some CPU cycles by
    using temp variable.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Pablo Neira Ayuso
     
  • Allow master and backup servers to use many threads
    for sync traffic. Add sysctl var "sync_ports" to define the
    number of threads. Every thread will use single UDP port,
    thread 0 will use the default port 8848 while last thread
    will use port 8848+sync_ports-1.

    The sync traffic for connections is scheduled to many
    master threads based on the cp address but one connection is
    always assigned to same thread to avoid reordering of the
    sync messages.

    Remove ip_vs_sync_switch_mode because this check
    for sync mode change is still risky. Instead, check for mode
    change under sync_buff_lock.

    Make sure the backup socks do not block on reading.

    Special thanks to Aleksey Chudov for helping in all tests.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Pablo Neira Ayuso
     
  • Add two new sysctl vars to control the sync rate with the
    main idea to reduce the rate for connection templates because
    currently it depends on the packet rate for controlled connections.
    This mechanism should be useful also for normal connections
    with high traffic.

    sync_refresh_period: in seconds, difference in reported connection
    timer that triggers new sync message. It can be used to
    avoid sync messages for the specified period (or half of
    the connection timeout if it is lower) if connection state
    is not changed from last sync.

    sync_retries: integer, 0..3, defines sync retries with period of
    sync_refresh_period/8. Useful to protect against loss of
    sync messages.

    Allow sysctl_sync_threshold to be used with
    sysctl_sync_period=0, so that only single sync message is sent
    if sync_refresh_period is also 0.

    Add new field "sync_endtime" in connection structure to
    hold the reported time when connection expires. The 2 lowest
    bits will represent the retry count.

    As the sysctl_sync_period now can be 0 use ACCESS_ONCE to
    avoid division by zero.

    Special thanks to Aleksey Chudov for being patient with me,
    for his extensive reports and helping in all tests.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • High rate of sync messages in master can lead to
    overflowing the socket buffer and dropping the messages.
    Fixed sleep of 1 second without wakeup events is not suitable
    for loaded masters,

    Use delayed_work to schedule sending for queued messages
    and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will
    reduce the rate of wakeups but to avoid sending long bursts we
    wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages.

    Add hard limit for the queued messages before sending
    by using "sync_qlen_max" sysctl var. It defaults to 1/32 of
    the memory pages but actually represents number of messages.
    It will protect us from allocating large parts of memory
    when the sending rate is lower than the queuing rate.

    As suggested by Pablo, add new sysctl var
    "sync_sock_size" to configure the SNDBUF (master) or
    RCVBUF (slave) socket limit. Default value is 0 (preserve
    system defaults).

    Change the master thread to detect and block on
    SNDBUF overflow, so that we do not drop messages when
    the socket limit is low but the sync_qlen_max limit is
    not reached. On ENOBUFS or other errors just drop the
    messages.

    Change master thread to enter TASK_INTERRUPTIBLE
    state early, so that we do not miss wakeups due to messages or
    kthread_should_stop event.

    Thanks to Pablo Neira Ayuso for his valuable feedback!

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Pablo Neira Ayuso
     
  • As the goal is to mirror the inactconns/activeconns
    counters in the backup server, make sure the cp->flags are
    updated even if cp is still not bound to dest. If cp->flags
    are not updated ip_vs_bind_dest will rely only on the initial
    flags when updating the counters. To avoid mistakes and
    complicated checks for protocol state rely only on the
    IP_VS_CONN_F_INACTIVE bit when updating the counters.

    Signed-off-by: Julian Anastasov
    Tested-by: Aleksey Chudov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Initially, when the synced connection is created we
    use the forwarding method provided by master but once we
    bind to destination it can be changed. As result, we must
    update the application and the transmitter.

    As ip_vs_try_bind_dest is called always for connections
    that require dest binding, there is no need to validate the
    cp and dest pointers.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • As the IP_VS_CONN_F_INACTIVE bit is properly set
    in cp->flags for all kind of connections we do not need to
    add special checks for synced connections when updating
    the activeconns/inactconns counters for first time. Now
    logic will look just like in ip_vs_unbind_dest.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • As IP_VS_CONN_F_NOOUTPUT is derived from the
    forwarding method we should get it from conn_flags just
    like we do it for IP_VS_CONN_F_FWD_MASK bits when binding
    to real server.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Use GFP_KERNEL instead of GFP_ATOMIC when registering an ipvs protocol.

    This is safe since it will always run from a process context.

    Signed-off-by: Sasha Levin
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    Sasha Levin
     
  • Schedulers are initialized and bound to services only
    on commands.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Schedulers are initialized and bound to services only
    on commands.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Schedulers are initialized and bound to services only
    on commands.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Schedulers are initialized and bound to services only
    on commands.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • Schedulers are initialized and bound to services only
    on commands.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • They are called only on initialization.

    Signed-off-by: Julian Anastasov
    Signed-off-by: Hans Schillstrom
    Signed-off-by: Simon Horman

    Julian Anastasov
     
  • if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
    netfilter removes the vlan header temporarily and then feeds the packet
    to ip(6)tables.

    When the new "bridge-nf-pass-vlan-input-device" sysctl is on
    (default off), then bridge netfilter will also set the
    in-interface to the vlan interface; if such an interface exists.

    This is needed to make iptables REDIRECT target work with
    "vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
    match the vlan device name.

    Also update Documentation with current brnf default settings.

    Signed-off-by: Florian Westphal
    Acked-by: Bart De Schuymer
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This patch allows you to disable automatic conntrack helper
    lookup based on TCP/UDP ports, eg.

    echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper

    [ Note: flows that already got a helper will keep using it even
    if automatic helper assignment has been disabled ]

    Once this behaviour has been disabled, you have to explicitly
    use the iptables CT target to attach helper to flows.

    There are good reasons to stop supporting automatic helper
    assignment, for further information, please read:

    http://www.netfilter.org/news.html#2012-04-03

    This patch also adds one message to inform that automatic helper
    assignment is deprecated and it will be removed soon (this is
    spotted only once, with the first flow that gets a helper attached
    to make it as less annoying as possible).

    Signed-off-by: Eric Leblond
    Signed-off-by: Pablo Neira Ayuso

    Eric Leblond
     
  • * ret variable initialization removed as useless
    * similar code strings concatenated and functions code
    flow became more plain

    Signed-off-by: Tony Zelenoff
    Signed-off-by: Pablo Neira Ayuso

    Tony Zelenoff
     

08 May, 2012

2 commits

  • Conflicts:
    drivers/net/ethernet/intel/e1000e/param.c
    drivers/net/wireless/iwlwifi/iwl-agn-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans.h

    Resolved the iwlwifi conflict with mainline using 3-way diff posted
    by John Linville and Stephen Rothwell. In 'net' we added a bug
    fix to make iwlwifi report a more accurate skb->truesize but this
    conflicted with RX path changes that happened meanwhile in net-next.

    In e1000e a conflict arose in the validation code for settings of
    adapter->itr. 'net-next' had more sophisticated logic so that
    logic was used.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Until now, struct mreq has not been recognized and it was worked with
    as with struct in_addr. That means imr_multiaddr was copied to
    imr_address. So do recognize struct mreq here and copy that correctly.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

07 May, 2012

3 commits

  • With the recent changes for how we compute the skb truesize it occurs to me
    we are probably going to have a lot of calls to skb_end_pointer -
    skb->head. Instead of running all over the place doing that it would make
    more sense to just make it a separate inline skb_end_offset(skb) that way
    we can return the correct value without having gcc having to do all the
    optimization to cancel out skb->head - skb->head.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Since there is now only one spot that actually uses "fastpath" there isn't
    much point in carrying it. Instead we can just use a check for skb_cloned
    to verify if we can perform the fast-path free for the head or not.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • The fast-path for pskb_expand_head contains a check where the size plus the
    unaligned size of skb_shared_info is compared against the size of the data
    buffer. This code path has two issues. First is the fact that after the
    recent changes by Eric Dumazet to __alloc_skb and build_skb the shared info
    is always placed in the optimal spot for a buffer size making this check
    unnecessary. The second issue is the fact that the check doesn't take into
    account the aligned size of shared info. As a result the code burns cycles
    doing a memcpy with nothing actually being shifted.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

05 May, 2012

1 commit

  • It appears some networks play bad games with the two bits reserved for
    ECN. This can trigger false congestion notifications and very slow
    transferts.

    Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
    disable TCP ECN negociation if it happens we receive mangled CT bits in
    the SYN packet.

    Signed-off-by: Eric Dumazet
    Cc: Perry Lorier
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Wilmer van der Gaast
    Cc: Ankur Jain
    Cc: Tom Herbert
    Cc: Dave Täht
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 May, 2012

4 commits

  • Use qdisc_drop() helper where possible.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull networking fixes from David Miller:

    1) Transfer padding was wrong for full-speed USB in ASIX driver, fix
    from Ingo van Lil.

    2) Propagate the negative packet offset fix into the PowerPC BPF JIT.
    From Jan Seiffert.

    3) dl2k driver's private ioctls were letting unprivileged tasks make
    MII writes and other ugly bits like that. Fix from Jeff Mahoney.

    4) Fix TX VLAN and RX packet drops in ucc_geth, from Joakim Tjernlund.

    5) OOPS and network namespace fixes in IPVS from Hans Schillstrom and
    Julian Anastasov.

    6) Fix races and sleeping in locked context bugs in drop_monitor, from
    Neil Horman.

    7) Fix link status indication in smsc95xx driver, from Paolo Pisati.

    8) Fix bridge netfilter OOPS, from Peter Huang.

    9) L2TP sendmsg can return on error conditions with the socket lock
    held, oops. Fix from Sasha Levin.

    10) udp_diag should return meaningful values for socket memory usage,
    from Shan Wei.

    11) Eric Dumazet is so awesome he gets his own section:

    Socket memory cgroup code (I never should have applied those
    patches, grumble...) made erroneous changes to
    sk_sockets_allocated_read_positive(). It was changed to
    use percpu_counter_sum_positive (which requires BH disabling)
    instead of percpu_counter_read_positive (which does not).
    Revert back to avoid crashes and lockdep warnings.

    Adjust the default tcp_adv_win_scale and tcp_rmem[2] values
    to fix throughput regressions. This is necessary as a result
    of our more precise skb->truesize tracking.

    Fix SKB leak in netem packet scheduler.

    12) New device IDs for various bluetooth devices, from Manoj Iyer,
    AceLan Kao, and Steven Harms.

    13) Fix command completion race in ipw2200, from Stanislav Yakovlev.

    14) Fix rtlwifi oops on unload, from Larry Finger.

    15) Fix hard_mtu when adjusting hard_header_len in smsc95xx driver.
    From Stephane Fillod.

    16) ehea driver registers it's IRQ before all the necessary state is
    setup, resulting in crashes. Fix from Thadeu Lima de Souza
    Cascardo.

    17) Fix PHY connection failures in davinci_emac driver, from Anatolij
    Gustschin.

    18) Missing break; in switch statement in bluetooth's
    hci_cmd_complete_evt(). Fix from Szymon Janc.

    19) Fix queue programming in iwlwifi, from Johannes Berg.

    20) Interrupt throttling defaults not being actually programmed into the
    hardware, fix from Jeff Kirsher and Ying Cai.

    21) TLAN driver SKB encoding in descriptor busted on 64-bit, fix from
    Benjamin Poirier.

    22) Fix blind status block RX producer pointer deref in TG3 driver, from
    Matt Carlson.

    23) Promisc and multicast are busted on ehea, fixes from Thadeu Lima de
    Souza Cascardo.

    24) Fix crashes in 6lowpan, from Alexander Smirnov.

    25) tcp_complete_cwr() needs to be careful to not rewind the CWND to
    ssthresh if ssthresh has the "infinite" value. Fix from Yuchung
    Cheng.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
    sungem: Fix WakeOnLan
    tcp: change tcp_adv_win_scale and tcp_rmem[2]
    net: l2tp: unlock socket lock before returning from l2tp_ip_sendmsg
    drop_monitor: prevent init path from scheduling on the wrong cpu
    usbnet: fix failure handling in usbnet_probe
    usbnet: fix leak of transfer buffer of dev->interrupt
    ucc_geth: Add 16 bytes to max TX frame for VLANs
    net: ucc_geth, increase no. of HW RX descriptors
    netem: fix possible skb leak
    sky2: fix receive length error in mixed non-VLAN/VLAN traffic
    sky2: propogate rx hash when packet is copied
    net: fix two typos in skbuff.h
    cxgb3: Don't call cxgb_vlan_mode until q locks are initialized
    ixgbe: fix calling skb_put on nonlinear skb assertion bug
    ixgbe: Fix a memory leak in IEEE DCB
    igbvf: fix the bug when initializing the igbvf
    smsc75xx: enable mac to detect speed/duplex from phy
    smsc75xx: declare smsc75xx's MII as GMII capable
    smsc75xx: fix phy interrupt acknowledge
    smsc75xx: fix phy init reset loop
    ...

    Linus Torvalds
     
  • This patch adds support for a skb_head_is_locked helper function. It is
    meant to be used any time we are considering transferring the head from
    skb->head to a paged frag. If the head is locked it means we cannot remove
    the head from the skb so it must be copied or we must take the skb as a
    whole.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • GRO is very optimistic in skb truesize estimates, only taking into
    account the used part of fragments.

    Be conservative, and use more precise computation, so that bloated GRO
    skbs can be collapsed eventually.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Jeff Kirsher
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2012

6 commits

  • This change cleans up the last bits of tcp_try_coalesce so that we only
    need one goto which jumps to the end of the function. The idea is to make
    the code more readable by putting things in a linear order so that we start
    execution at the top of the function, and end it at the bottom.

    I also made a slight tweak to the code for handling frags when we are a
    clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed
    the logic so that if (!clone) we just set the number of frags to 0 which
    disables the for loop anyway.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This change reorders the code related to the use of an skb->head_frag so it
    is placed before we check the rest of the frags. This allows the code to
    read more linearly instead of like some sort of loop.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch addresses several issues in the way we were tracking the
    truesize in tcp_try_coalesce.

    First it was using ksize which prevents us from having a 0 sized head frag
    and getting a usable result. To resolve that this patch uses the end
    pointer which is set based off either ksize, or the frag_size supplied in
    build_skb. This allows us to compute the original truesize of the entire
    buffer and remove that value leaving us with just what was added as pages.

    The second issue was the use of skb->len if there is a mergeable head frag.
    We should only need to remove the size of an data aligned sk_buff from our
    current skb->truesize to compute the delta for a buffer with a reused head.
    By using skb->len the value of truesize was being artificially reduced
    which means that head frags could use more memory than buffers using
    standard allocations.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Reported-by: Stephen Rothwell
    Signed-off-by: David S. Miller

    David S. Miller
     
  • This change is meant ot prevent stealing the skb->head to use as a page in
    the event that the skb->head was cloned. This allows the other clones to
    track each other via shinfo->dataref.

    Without this we break down to two methods for tracking the reference count,
    one being dataref, the other being the page count. As a result it becomes
    difficult to track how many references there are to skb->head.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
    receiver function when application is not blocked in recvmsg().

    Function tcp_queue_rcv() is moved a bit to allow its call from
    tcp_data_queue()

    This gives good results especially if GRO could not kick, and if skb
    head is a fragment.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet