07 Oct, 2015

1 commit

  • Eric's net namespace changes in 1b75097dd7a26 leaves net unreferenced if
    CONFIG_IP_VS_IPV6 is not enabled:

    ../net/netfilter/ipvs/ip_vs_core.c: In function ‘ip_vs_out’:
    ../net/netfilter/ipvs/ip_vs_core.c:1177:14: warning: unused variable ‘net’ [-Wunused-variable]

    After the net refactoring there is only 1 user; push the reference to the
    1 user. While the line length slightly exceeds 80 it seems to be the
    best change.

    Fixes: 1b75097dd7a26("ipvs: Pass ipvs into ip_vs_out")
    Signed-off-by: David Ahern
    Acked-by: Julian Anastasov
    [horms: updated subject]
    Signed-off-by: Simon Horman

    David Ahern
     

05 Oct, 2015

9 commits

  • This patch enables to include the conntrack information together
    with the packet that is sent to user-space via NFLOG, then a
    user-space program can acquire NATed information by this NFULA_CT
    attribute.

    Including the conntrack information is optional, you can set it
    via NFULNL_CFG_F_CONNTRACK flag with the NFULA_CFG_FLAGS attribute
    like NFQUEUE.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Pablo Neira Ayuso

    Ken-ichirou MATSUZAWA
     
  • get_ct as is and will not update its skb argument, and users of
    nfnl_ct_hook is currently only nfqueue, we can add const qualifier.

    Signed-off-by: Ken-ichirou MATSUZAWA

    Ken-ichirou MATSUZAWA
     
  • Conntrack information attaching infrastructure is now generic and
    update it's name to use `glue' in previous patch. This patch updates
    Kconfig symbol name and adding NF_CT_NETLINK dependency.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Pablo Neira Ayuso

    Ken-ichirou MATSUZAWA
     
  • The idea of this series of patch is to attach conntrack information to
    nflog like nfqueue has already done. nfqueue conntrack info attaching
    basis is generic, rename those names to generic one, glue.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Pablo Neira Ayuso

    Ken-ichirou MATSUZAWA
     
  • Remove __nf_conntrack_find() from headers.

    Fixes: dcd93ed4cd1 ("netfilter: nf_conntrack: remove dead code")
    Signed-off-by: Flavio Leitner
    Signed-off-by: Pablo Neira Ayuso

    Flavio Leitner
     
  • The __build_packet_message function fills a nfulnl_msg_packet_timestamp
    structure that uses 64-bit seconds and is therefore y2038 safe, but
    it uses an intermediate 'struct timespec' which is not.

    This trivially changes the code to use 'struct timespec64' instead,
    to correct the result on 32-bit architectures.

    This is a copy and paste of Arnd's original patch for nfnetlink_log.

    Suggested-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Simon Horman says:

    ====================
    Third Round of IPVS Updates for v4.4

    please consider this build fix from Eric Biederman which resolves
    a build problem introduced in is excellent work to cleanup IPVS which
    you recently pulled: its queued up for v4.4 so no need to worry
    about earlier kernel versions.

    I have another minor cleanup, to fix a build warning, pending.
    However, I wanted to send this one to you now as its hit nf-next,
    net-next and in turn next, and a slow trickle of bug reports are appearing.
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Now that we have integrated the ct glue code into nfnetlink_queue without
    introducing dependencies with the conntrack code.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • The original intention was to avoid dependencies between nfnetlink_queue and
    conntrack without ifdef pollution. However, we can achieve this by moving the
    conntrack dependent code into ctnetlink and keep some glue code to access the
    nfq_ct indirection from nfqueue.

    After this patch, the nfq_ct indirection is always compiled in the netfilter
    core to avoid polluting nfqueue with ifdefs. Thus, if nf_conntrack is not
    compiled this results in only 8-bytes of memory waste in x86_64.

    This patch also adds ctnetlink_nfqueue_seqadj() to avoid that the nf_conn
    structure layout if exposed to nf_queue, which creates another dependency with
    nf_conntrack at compilation time.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

04 Oct, 2015

1 commit

  • Before letting request sockets being put in TCP/DCCP regular
    ehash table, we need to add either :

    - SLAB_DESTROY_BY_RCU flag to their kmem_cache
    - add RCU grace period before freeing them.

    Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
    like ESTABLISH and TIMEWAIT sockets, use it here.

    req_prot_init() being only used by TCP and DCCP, I did not add
    a new slab_flags into their rsk_prot, but reuse prot->slab_flags

    Since all reqsk_alloc() users are correctly dealing with a failure,
    add the __GFP_NOWARN flag to avoid traces under pressure.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

29 commits

  • Jeff Kirsher says:

    ====================
    Intel Wired LAN Driver Updates 2015-09-30

    This series contains updates to i40e and i40evf only.

    Vasily Averin provides a couple of rtnl lock/unlock fixes for both i40e
    and i40evf.

    Shannon provides several updates and fixes, first fixes up a type clash
    in i40e_aq_rc_to_posix(), where the error codes are signed values, so we
    need to treat them as such. Then fixes up a padding issue where an
    extra byte is added in i40e_aqc_get_cee_dcb_cfg_v1_resp to directly
    acknowledge the padding. Updated i40e to keep debugfs register read
    and writes from accessing outside of the io-remapped space. Added
    support and device id for another 20 GbE device.

    Jesse fixes the transmit hand workaround code for ARM that was causing
    Tx hangs to still occur occasionally when there really was no hang. Then
    fixed the receive dropped counter to show up in netstat interface.
    Refactor the interrupt enable function since it was always making the
    caller add the base_vector from the VSI struct which is already passed
    to the function. Fix kbuild warnings found in 0day build infrastructure
    by adding a harmless cast to a dev_info(), also fix 32 bit build
    warnings found by sparse.

    Greg fixed a configuration error that results if a port VLAN is set
    for a VF before the VF driver is loaded, so that when the VF driver is
    loaded the port VLAN is ignored.

    Mitch fixes the use of QOS field consistently in
    i40e_ndo_set_vf_port_vlan(). Modified the init timing of the driver
    to increase stability on load/unload and SR-IOV enable/disable cycles.

    Anjali updates i40e to not collect VEB stats if they are disabled in the
    hardware for performance reasons.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Simon Horman says:

    ====================
    ravb: Add support for r8a7795 SoC

    please consider this series for net-next.
    It enhances the ravb driver to support the r8a7795 SoC.

    Changes:

    * Dropped RFC prefix
    * Details in changelog of individual patches

    Base:

    * net-next/master

    Availability:

    To aid review of this in conjunction with other EtherAVB changes
    the following branches are available in my renesas tree on kernel.org.

    * me/r8a7795-ravb-driver-v4: this series
    * me/r8a7795-ravb-pfc-v2: r8a7795 sh-pfc update for EthernetAVB
    * me/r8a7795-ravb-integration-v4: enable EthernetAVB on r8a7795
    * me/r8a7795-ravb-driver-and-integration-v4.runtime:
    the above three branches with their runtime dependencies
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch supports the r8a7795 SoC by:
    - Using two interrupts
    + One for E-MAC
    + One for everything else
    + Both can be handled by the existing common interrupt handler, which
    affords a simpler update to support the new SoC. In future some
    consideration may be given to implementing multiple interrupt handlers
    - Limiting the phy speed to 100Mbit/s for the new SoC;
    at this time it is not clear how this restriction may be lifted
    but I hope it will be possible as more information comes to light

    Signed-off-by: Kazuya Mizuguchi
    [horms: reworked]
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Kazuya Mizuguchi
     
  • This patch updates the ravb binding to support the r8a7795 SoC by:
    - Adding a compat string for the new hardware
    - Adding 25 named interrupts to binding for the new SoC;
    older SoCs continue to use a single multiplexed interrupt

    The example is also updated to reflect the r8a7795 as this is the
    more complex case.

    Based on work by Kazuya Mizuguchi and others.

    Signed-off-by: Simon Horman
    Acked-by: Geert Uytterhoeven
    Signed-off-by: David S. Miller

    Kazuya Mizuguchi
     
  • This patch is in preparation for using this driver on arm64 where the
    implementation of __dma_alloc_coherent fails if a device parameter is not
    provided.

    Signed-off-by: Kazuya Mizuguchi
    Signed-off-by: Yoshihiro Shimoda
    Signed-off-by: Masaru Nagai
    [horms: squashed into a single patch]
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Kazuya Mizuguchi
     
  • Add a helper to allow ethernet drivers to limit the speed of a phy
    (that they are attached to).

    This mainly involves factoring out the business-end of
    of_set_phy_supported() and exporting a new symbol.

    This code seems to be open coded in several places, in several different
    variants.

    It is is envisaged that this will be used in situations where setting the
    "max-speed" property in DT is not appropriate, e.g. because the maximum
    speed is not a property of the phy hardware.

    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Simon Horman
     
  • Daniel Borkmann says:

    ====================
    BPF updates

    Some minor updates to {cls,act}_bpf to retrieve routing realms
    and to make skb->priority writable.

    Thanks!

    v1 -> v2:
    - Dropped preclassify patch for now from the series as the
    rest is pretty much independent of it
    - Rest unchanged, only rebased and already posted Acked-by's kept
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • {cls,act}_bpf can now set the skb->priority from an eBPF program based
    on various critera, so that for example classful qdiscs like multiq can
    update the skb's priority during enqueue time and further push it down
    into subsequent qdiscs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Using routing realms as part of the classifier is quite useful, it
    can be viewed as a tag for one or multiple routing entries (think of
    an analogy to net_cls cgroup for processes), set by user space routing
    daemons or via iproute2 as an indicator for traffic classifiers and
    later on processed in the eBPF program.

    Unlike actions, the classifier can inspect device flags and enable
    netif_keep_dst() if necessary. tc actions don't have that possibility,
    but in case people know what they are doing, it can be used from there
    as well (e.g. via devs that must keep dsts by design anyway).

    If a realm is set, the handler returns the non-zero realm. User space
    can set the full 32bit realm for the dst.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As we need to add further flags to the bpf_prog structure, lets migrate
    both bools to a bitfield representation. The size of the base structure
    (excluding insns) remains unchanged at 40 bytes.

    Add also tags for the kmemchecker, so that it doesn't throw false
    positives. Even in case gcc would generate suboptimal code, it's not
    being accessed in performance critical paths.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Jiri Pirko says:

    ====================
    switchdev: bring back switchdev_obj

    Second version of the patch extends to a patchset. Basically this patchset
    brings object structure back which disappeared with recent Vivien's patchset.
    Also it does a bit of naming changes in order to get the things in line.
    Also, object id is put back into object structure.
    Thanks to Scott and Vivien for review and suggestions.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Suggested-by: Scott Feldman
    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Replace "void *obj" with a generic structure. Introduce couple of
    helpers along that.

    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Make the struct name in sync with object id name.

    Suggested-by: Vivien Didelot
    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Make the struct name in sync with object id name.

    Suggested-by: Vivien Didelot
    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • To be aligned with obj.

    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Suggested-by: Vivien Didelot
    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Eric Dumazet says:

    ====================
    tcp/dccp: lockless listener

    TCP listener refactoring : this is becoming interesting !

    This patch series takes the steps to use normal TCP/DCCP ehash
    table to store SYN_RECV requests, instead of the private per-listener
    hash table we had until now.

    SYNACK skb are now attached to their syn_recv request socket,
    so that we no longer heavily modify listener sk_wmem_alloc.

    listener lock is no longer held in fast path, including
    SYNCOOKIE mode.

    During my tests, my server was able to process 3,500,000
    SYN packets per second on one listener and still had available
    cpu cycles.

    That is about 2 to 3 order of magnitude what we had with older kernels.

    This effort started two years ago and I am pleased to reach expectations.

    We'll probably extend SO_REUSEPORT to add proper cpu/numa affinities,
    so that heavy duty TCP servers can get proper siloing thanks to multi-queues
    NIC.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Everything should now be ready to finally allow SYN
    packets processing without holding listener lock.

    Tested:

    3.5 Mpps SYNFLOOD. Plenty of cpu cycles available.

    Next bottleneck is the refcount taken on listener,
    that could be avoided if we remove SLAB_DESTROY_BY_RCU
    strict semantic for listeners, and use regular RCU.

    13.18% [kernel] [k] __inet_lookup_listener
    9.61% [kernel] [k] tcp_conn_request
    8.16% [kernel] [k] sha_transform
    5.30% [kernel] [k] inet_reqsk_alloc
    4.22% [kernel] [k] sock_put
    3.74% [kernel] [k] tcp_make_synack
    2.88% [kernel] [k] ipt_do_table
    2.56% [kernel] [k] memcpy_erms
    2.53% [kernel] [k] sock_wfree
    2.40% [kernel] [k] tcp_v4_rcv
    2.08% [kernel] [k] fib_table_lookup
    1.84% [kernel] [k] tcp_openreq_init_rwin

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If a listener with thousands of children in accept queue
    is dismantled, it can take a while to close all of them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This control variable was set at first listen(fd, backlog)
    call, but not updated if application tried to increase or decrease
    backlog. It made sense at the time listener had a non resizeable
    hash table.

    Also rounding to powers of two was not very friendly.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • It is enough to check listener sk_state, no need for an extra
    condition.

    max_qlen_log can be moved into struct request_sock_queue

    We can remove syn_wait_lock and the alignment it enforced.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If a listen backlog is very big (to avoid syncookies), then
    the listener sk->sk_wmem_alloc is the main source of false
    sharing, as we need to touch it twice per SYNACK re-transmit
    and TX completion.

    (One SYN packet takes listener lock once, but up to 6 SYNACK
    are generated)

    By attaching the skb to the request socket, we remove this
    source of contention.

    Tested:

    listen(fd, 10485760); // single listener (no SO_REUSEPORT)
    16 RX/TX queue NIC
    Sustain a SYNFLOOD attack of ~320,000 SYN per second,
    Sending ~1,400,000 SYNACK per second.
    Perf profiles now show listener spinlock being next bottleneck.

    20.29% [kernel] [k] queued_spin_lock_slowpath
    10.06% [kernel] [k] __inet_lookup_established
    5.12% [kernel] [k] reqsk_timer_handler
    3.22% [kernel] [k] get_next_timer_interrupt
    3.00% [kernel] [k] tcp_make_synack
    2.77% [kernel] [k] ipt_do_table
    2.70% [kernel] [k] run_timer_softirq
    2.50% [kernel] [k] ip_finish_output
    2.04% [kernel] [k] cascade

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • inet6_csk_search_req() and inet6_csk_reqsk_queue_hash_add()
    no longer exist.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We no longer use hash_rnd, nr_table_entries and syn_table[]

    For a listener with a backlog of 10 millions sockets, this
    saves 80 MBytes of vmalloced memory.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is no longer used.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When request sockets are no longer in a per listener hash table
    but on regular TCP ehash, we need to access listener uid
    through req->rsk_listener

    get_openreq6() also gets a const for its request socket argument.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Once listener is lockless, its sk_state can change anytime.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet