27 Sep, 2016

4 commits

  • This prevent future potential pointer leaks when an unprivileged eBPF
    program will read a pointer value from its context. Even if
    is_valid_access() returns a pointer type, the eBPF verifier replace it
    with UNKNOWN_VALUE. The register value that contains a kernel address is
    then allowed to leak. Moreover, this fix allows unprivileged eBPF
    programs to use functions with (legitimate) pointer arguments.

    Not an issue currently since reg_type is only set for PTR_TO_PACKET or
    PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
    privileged. For now, the only unprivileged eBPF program allowed is for
    socket filtering and all the types from its context are UNKNOWN_VALUE.
    However, this fix is important for future unprivileged eBPF programs
    which could use pointers in their context.

    Signed-off-by: Mickaël Salaün
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Mickaël Salaün
     
  • seccomp_phase1() does not exist anymore. Instead, update sample to use
    __seccomp_filter(). While at it, set max locked memory to unlimited.

    Signed-off-by: Naveen N. Rao
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Naveen N. Rao
     
  • These samples fail to compile as 'struct flow_keys' conflicts with
    definition in net/flow_dissector.h. Fix the same by renaming the
    structure used in the sample.

    Signed-off-by: Naveen N. Rao
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Naveen N. Rao
     
  • The ethtool api {get|set}_settings is deprecated.
    We move this driver to new api {get|set}_link_ksettings.

    Signed-off-by: Philippe Reynes
    Acked-by: Michael Chan
    Signed-off-by: David S. Miller

    Philippe Reynes
     

26 Sep, 2016

20 commits

  • Jason Baron says:

    ====================
    bnx2x: page allocation failure

    While configuring ~500 multicast addrs, we ran into high order
    page allocation failures. They don't need to be high order, and
    thus I'm proposing to split them into at most PAGE_SIZE allocations.

    Below is a sample failure.

    [1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create multicast MACs list: -12
    [1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
    [1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: G W
    [1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, BIOS V8.810 05/16/2013
    [1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
    [1207325.728206] 0000000000000000 ffff88012d873a78 ffffffff8267f7c7 000000000000c020
    [1207325.736754] 0000000000000000 ffff88012d873b08 ffffffff8212f8e0 fffffffc00000003
    [1207325.745301] ffff88041ffecd80 ffff880400000030 0000000000000002 0000c0206800da13
    [1207325.753846] Call Trace:
    [1207325.756789] [] dump_stack+0x4d/0x63
    [1207325.762426] [] warn_alloc_failed+0xe0/0x130
    [1207325.768756] [] ? wakeup_kswapd+0x48/0x140
    [1207325.774914] [] __alloc_pages_nodemask+0x2bc/0x970
    [1207325.781761] [] alloc_pages_current+0x91/0x100
    [1207325.788260] [] alloc_kmem_pages+0xe/0x10
    [1207325.794329] [] kmalloc_order+0x18/0x50
    [1207325.800227] [] kmalloc_order_trace+0x26/0xb0
    [1207325.806642] [] ? _xfer_secondary_pool+0xa8/0x1a0
    [1207325.813404] [] __kmalloc+0x19a/0x1b0
    [1207325.819142] [] bnx2x_set_rx_mode_inner+0x3d5/0x590 [bnx2x]
    [1207325.827000] [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
    [1207325.834197] [] process_one_work+0x134/0x3c0
    [1207325.840522] [] worker_thread+0x121/0x460
    [1207325.846585] [] ? process_one_work+0x3c0/0x3c0
    [1207325.853089] [] kthread+0xc9/0xe0
    [1207325.858459] [] ? notify_die+0x10/0x40
    [1207325.864263] [] ? kthread_create_on_node+0x180/0x180
    [1207325.871288] [] ret_from_fork+0x42/0x70
    [1207325.877183] [] ? kthread_create_on_node+0x180/0x180

    v2:
    -make use of list_next_entry()
    -only use PAGE_SIZE allocations
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Currently, we can have high order page allocations that specify
    GFP_ATOMIC when configuring multicast MAC address filters.

    For example, we have seen order 2 page allocation failures with
    ~500 multicast addresses configured.

    Convert the allocation for the pending list to be done in PAGE_SIZE
    increments.

    Signed-off-by: Jason Baron
    Cc: Yuval Mintz
    Cc: Ariel Elior
    Acked-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Jason Baron
     
  • Currently, we can have high order page allocations that specify
    GFP_ATOMIC when configuring multicast MAC address filters.

    For example, we have seen order 2 page allocation failures with
    ~500 multicast addresses configured.

    Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
    increments.

    Signed-off-by: Jason Baron
    Cc: Yuval Mintz
    Cc: Ariel Elior
    Signed-off-by: David S. Miller

    Jason Baron
     
  • I stumbled over a new warning during randconfig testing,
    with CONFIG_BPF_SYSCALL disabled:

    drivers/net/ethernet/netronome/nfp/nfp_net_offload.c: In function 'nfp_net_bpf_offload':
    drivers/net/ethernet/netronome/nfp/nfp_net_offload.c:263:3: error: '*((void *)&res+4)' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    drivers/net/ethernet/netronome/nfp/nfp_net_offload.c:263:3: error: 'res.n_instr' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    As far as I can tell, this is a false positive caused by the compiler
    getting confused about a function that is partially inlined, but it's
    easy to avoid while improving the code:

    The nfp_bpf_jit() stub helper for that configuration is unusual as it
    is defined in a header file but not marked 'static inline'. By moving
    the compile-time check into the caller using the IS_ENABLED() macro,
    we can remove that stub and simplify the nfp_net_bpf_offload_prepare()
    function enough to unconfuse the compiler.

    Fixes: 7533fdc0f77f ("nfp: bpf: add hardware bpf offload")
    Signed-off-by: Arnd Bergmann
    Acked-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • We get 1 warning when building kernel with W=1:
    drivers/net/ethernet/broadcom/genet/bcmgenet.c:2763:5: warning: no previous prototype for 'bcmgenet_hfb_add_filter' [-Wmissing-prototypes]

    In fact, this function is implemented in
    drivers/net/ethernet/broadcom/genet/bcmgenet.c, but be called
    by no one, thus can be removed.

    So this patch removes the unused functions.

    Signed-off-by: Baoyou Xie
    Signed-off-by: David S. Miller

    Baoyou Xie
     
  • We get 10 warnings when building kernel with W=1:
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:304:5: warning: no previous prototype for 'cxgb4_dcb_enabled' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:194:5: warning: no previous prototype for 'setup_sge_queues_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:241:6: warning: no previous prototype for 'free_sge_queues_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:268:5: warning: no previous prototype for 'cfg_queues_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:344:6: warning: no previous prototype for 'free_queues_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:353:5: warning: no previous prototype for 'request_msix_queue_irqs_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:379:6: warning: no previous prototype for 'free_msix_queue_irqs_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:393:6: warning: no previous prototype for 'name_msix_vecs_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:433:6: warning: no previous prototype for 'enable_rx_uld' [-Wmissing-prototypes]
    drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:442:6: warning: no previous prototype for 'quiesce_rx_uld' [-Wmissing-prototypes]

    In fact, these functions are only used in the file in which they are
    declared and don't need a declaration, but can be made static.
    so this patch marks these functions with 'static'.

    Signed-off-by: Baoyou Xie
    Signed-off-by: David S. Miller

    Baoyou Xie
     
  • We get 2 warnings when building kernel with W=1:
    drivers/net/ethernet/marvell/mvneta.c:639:27: warning: no previous prototype for 'mvneta_get_stats64' [-Wmissing-prototypes]
    drivers/net/ethernet/marvell/mvneta.c:3529:5: warning: no previous prototype for 'mvneta_ethtool_set_link_ksettings' [-Wmissing-prototypes]

    In fact, these two functions are only used in the file in which they are
    declared and don't need a declaration, but can be made static.
    so this patch marks these functions with 'static'.

    Signed-off-by: Baoyou Xie
    Signed-off-by: David S. Miller

    Baoyou Xie
     
  • We get 1 warning when building kernel with W=1:
    drivers/net/ethernet/hisilicon/hip04_eth.c:603:22: warning: no previous prototype for 'tx_done' [-Wmissing-prototypes]

    In fact, this function is only used in the file in which it is
    declared and don't need a declaration, but can be made static.
    so this patch marks this function with 'static'.

    Signed-off-by: Baoyou Xie
    Signed-off-by: David S. Miller

    Baoyou Xie
     
  • We get 2 warnings when building kernel with W=1:
    drivers/net/ethernet/hisilicon/hisi_femac.c:943:5: warning: no previous prototype for 'hisi_femac_drv_suspend' [-Wmissing-prototypes]
    drivers/net/ethernet/hisilicon/hisi_femac.c:960:5: warning: no previous prototype for 'hisi_femac_drv_resume' [-Wmissing-prototypes]

    In fact, these two functions are only used in the file in which they are
    declared and don't need a declaration, but can be made static.
    so this patch marks these functions with 'static'.

    Signed-off-by: Baoyou Xie
    Signed-off-by: David S. Miller

    Baoyou Xie
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2016-09-25

    Here are a few more Bluetooth & 802.15.4 patches for the 4.9 kernel that
    have popped up during the past week:

    - New USB ID for QCA_ROME Bluetooth device
    - NULL pointer dereference fix for Bluetooth mgmt sockets
    - Fixes for BCSP driver
    - Fix for updating LE scan response

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Fixes the following sparse warnings:

    drivers/net/dsa/mv88e6xxx/chip.c:219:5: warning:
    symbol 'mv88e6xxx_port_read' was not declared. Should it be static?
    drivers/net/dsa/mv88e6xxx/chip.c:227:5: warning:
    symbol 'mv88e6xxx_port_write' was not declared. Should it be static?

    Signed-off-by: Wei Yongjun
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • Fixes the following sparse warnings:

    drivers/net/ethernet/emulex/benet/be_main.c:47:25: warning:
    symbol 'be_err_recovery_workq' was not declared. Should it be static?
    drivers/net/ethernet/emulex/benet/be_main.c:63:25: warning:
    symbol 'be_wq' was not declared. Should it be static?

    Signed-off-by: Wei Yongjun
    Signed-off-by: David S. Miller

    Wei Yongjun
     
  • This aligns smc91x with its cousin, namely smc911x.c.
    This also allows the driver to run also in a device-tree based lubbock
    board build, on which it was tested.

    Signed-off-by: Robert Jarzmik
    Signed-off-by: David S. Miller

    Robert Jarzmik
     
  • iq is unsigned, so the error check for iq < 0 has no effect so errors
    can slip past this check. Fix this by making iq signed and also
    get_filter_steerq return a signed int so a -ve error can be returned.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree, they are:

    1) Consolidate GRE protocol tracker using new GRE protocol definitions,
    patches from Gao Feng.

    2) Properly parse continuation lines in SIP helper, update allowed
    characters in Call-ID header and allow tabs in SIP headers as
    specified by RFC3261, from Marco Angaroni.

    3) Remove useless code in FTP conntrack helper, also from Gao Feng.

    4) Add number generation expression for nf_tables, with random and
    incremental generators. This also includes specific offset to add
    to the result, patches from Laura Garcia Liebana. Liping Zhang
    follows with a fix to avoid a race in this new expression.

    5) Fix new quota expression inversion logic, added in the previous
    pull request.

    6) Missing validation of queue configuration in nft_queue, patch
    from Liping Zhang.

    7) Remove unused ctl_table_path, as part of the deprecation of the
    ip_conntrack sysctl interface coming in the previous batch.
    Again from Liping Zhang.

    8) Add offset attribute to nft_hash expression, so we can generate
    any output from a specific base offset. Moreover, check for
    possible overflow, patches from Laura Garcia.

    9) Allow to invert dynamic set insertion from packet path, to check
    for overflows in case the set is full.

    10) Revisit nft_set_pktinfo*() logic from nf_tables to ensure
    proper initialization of layer 4 protocol. Consolidate pktinfo
    structure initialization for bridge and netdev families.

    11) Do not inconditionally drop IPv6 packets that we cannot parse
    transport protocol for ip6 and inet families, let the user decide
    on this via ruleset policy.

    12) Get rid of gotos in __nf_ct_try_assign_helper().

    13) Check for return value in register_netdevice_notifier() and
    nft_register_chain_type(), patches from Gao Feng.

    14) Get rid of CONFIG_IP6_NF_IPTABLES dependency in nf_queue
    infrastructure that is common to nf_tables, from Liping Zhang.

    15) Disable 'found' and 'searched' stats that are updates from the
    packet hotpath, not very useful these days.

    16) Validate maximum value of u32 netlink attributes in nf_tables,
    this introduces nft_parse_u32_check(). From Laura Garcia.

    17) Add missing code to integrate nft_queue with maps, patch from
    Liping Zhang. This also includes missing support ranges in
    nft_queue bridge family.

    18) Fix check in nft_payload_fast_eval() that ensure that we don't
    go over the skbuff data boundary, from Liping Zhang.

    19) Check if transport protocol is set from nf_tables tracing and
    payload expression. Again from Liping Zhang.

    20) Use net_get_random_once() whenever possible, from Gao Feng.

    21) Replace hardcoded value by sizeof() in xt_helper, from Gao Feng.

    22) Remove superfluous check for found element in nft_lookup.

    23) Simplify TCPMSS logic to check for minimum MTU, from Gao Feng.

    24) Replace double linked list by single linked list in Netfilter
    core hook infrastructure, patchset from Aaron Conole. This
    includes several patches to prepare this update.

    25) Fix wrong sequence adjustment of TCP RST with no ACK, from
    Gao Feng.

    26) Relax check for direction attribute in nft_ct for layer 3 and 4
    protocol fields, from Liping Zhang.

    27) Add new revision for hashlimit to support higher pps of upto 1
    million, from Vishwanath Pai.

    28) Evict stale entries in nf_conntrack when reading entries from
    /proc/net/nf_conntrack, from Florian Westphal.

    29) Fix transparent match for IPv6 request sockets, from Krisztian
    Kovacs.

    30) Add new range expression for nf_tables.

    31) Add missing code to support for flags in nft_log. Expose NF_LOG_*
    flags via uapi and use it from the generic logging infrastructure,
    instead of using xt specific definitions, from Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Conflicts:
    net/netfilter/core.c
    net/netfilter/nf_tables_netdev.c

    Resolve two conflicts before pull request for David's net-next tree:

    1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant
    ip_hdr assignment") from the net tree and commit ddc8b6027ad0
    ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

    2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and
    Aaron Conole's patches to replace list_head with single linked list.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
    here is not appropriate. Replace them with NF_LOG_XXX.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • NFTA_LOG_FLAGS attribute is already supported, but the related
    NF_LOG_XXX flags are not exposed to the userspace. So we cannot
    explicitly enable log flags to log uid, tcp sequence, ip options
    and so on, i.e. such rule "nft add rule filter output log uid"
    is not supported yet.

    So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
    order to keep consistent with other modules, change NF_LOG_MASK to
    refer to all supported log flags. On the other hand, add a new
    NF_LOG_DEFAULT_MASK to refer to the original default log flags.

    Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
    and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
    userspace.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Inverse ranges != [a,b] are not currently possible because rules are
    composites of && operations, and we need to express this:

    data < a || data > b

    This patch adds a new range expression. Positive ranges can be already
    through two cmp expressions:

    cmp(sreg, data, >=)
    cmp(sreg, data,

    Pablo Neira Ayuso
     
  • The introduction of TCP_NEW_SYN_RECV state, and the addition of request
    sockets to the ehash table seems to have broken the --transparent option
    of the socket match for IPv6 (around commit a9407000).

    Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
    listener, the --transparent option tries to match on the no_srccheck flag
    of the request socket.

    Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
    by copying the transparent flag of the listener socket. This effectively
    causes '-m socket --transparent' not match on the ACK packet sent by the
    client in a TCP handshake.

    Based on the suggestion from Eric Dumazet, this change moves the code
    initializing no_srccheck to tcp_conn_request(), rendering the above
    scenario working again.

    Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
    Signed-off-by: Alex Badics
    Signed-off-by: KOVACS Krisztian
    Signed-off-by: Pablo Neira Ayuso

    KOVACS Krisztian
     

25 Sep, 2016

16 commits

  • Fabian reports a possible conntrack memory leak (could not reproduce so
    far), however, one minor issue can be easily resolved:

    > cat /proc/net/nf_conntrack | wc -l = 5
    > 4 minutes required to clean up the table.

    We should not report those timed-out entries to the user in first place.
    And instead of just skipping those timed-out entries while iterating over
    the table we can also zap them (we already do this during ctnetlink
    walks, but I forgot about the /proc interface).

    Fixes: f330a7fdbe16 ("netfilter: conntrack: get rid of conntrack timer")
    Reported-by: Fabian Frederick
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Create a new revision for the hashlimit iptables extension module. Rev 2
    will support higher pps of upto 1 million, Version 1 supports only 10k.

    To support this we have to increase the size of the variables avg and
    burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
    and xt_hashlimit_mtinfo2 and also create newer versions of all the
    functions for match, checkentry and destroy.

    Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
    similar in both rev1 and rev2 with only minor changes, so I have split
    those functions and moved all the common code to a *_common function.

    Signed-off-by: Vishwanath Pai
    Signed-off-by: Joshua Hunt
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     
  • I am planning to add a revision 2 for the hashlimit xtables module to
    support higher packets per second rates. This patch renames all the
    functions and variables related to revision 1 by adding _v1 at the
    end of the names.

    Signed-off-by: Vishwanath Pai
    Signed-off-by: Joshua Hunt
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     
  • NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
    specified, report EINVAL to the userspace. This validation check was
    already done at nft_ct_get_init, but we missed it in nft_ct_set_init.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Currently, if the user want to match ct l3proto, we must specify the
    direction, for example:
    # nft add rule filter input ct original l3proto ipv4
    ^^^^^^^^
    Otherwise, error message will be reported:
    # nft add rule filter input ct l3proto ipv4
    nft add rule filter input ct l3proto ipv4
    :1:1-38: Error: Could not process rule: Invalid argument
    add rule filter input ct l3proto ipv4
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Actually, there's no need to require NFTA_CT_DIRECTION attr, because
    ct l3proto and protocol are unrelated to direction.

    And for compatibility, even if the user specify the NFTA_CT_DIRECTION
    attr, do not report error, just skip it.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • It is valid that the TCP RST packet which does not set ack flag, and bytes
    of ack number are zero. But current seqadj codes would adjust the "0" ack
    to invalid ack number. Actually seqadj need to check the ack flag before
    adjust it for these RST packets.

    The following is my test case

    client is 10.26.98.245, and add one iptable rule:
    iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
    --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
    tcp-reset
    This iptables rule could generate on TCP RST without ack flag.

    server:10.172.135.55
    Enable the synproxy with seqadjust by the following iptables rules
    iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
    -m tcp --syn -j CT --notrack

    iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
    --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
    --mss 1460
    iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
    --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

    The following is my test result.

    1. packet trace on client
    root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
    win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
    length 0
    IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
    ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
    nop,wscale 7], length 0
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
    options [nop,nop,TS val 452367885 ecr 15643479], length 0
    IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
    options [nop,nop,TS val 15643479 ecr 452367885], length 0
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
    win 0, length 0

    2. seqadj log on server
    [62873.867319] Adjusting sequence number from 602341895->546723267,
    ack from 3695959830->3695959830
    [62873.867644] Adjusting sequence number from 602341895->546723267,
    ack from 3695959830->3695959830
    [62873.869040] Adjusting sequence number from 3695959830->3695959830,
    ack from 0->55618628

    To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
    one TCP RST packet without ack.

    Signed-off-by: Gao Feng
    Signed-off-by: Pablo Neira Ayuso

    Gao Feng
     
  • The netfilter hook list never uses the prev pointer, and so can be trimmed to
    be a simple singly-linked list.

    In addition to having a more light weight structure for hook traversal,
    struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
    2176 bytes (down from 2240).

    Signed-off-by: Aaron Conole
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Aaron Conole
     
  • Jeff Kirsher says:

    ====================
    40GbE Intel Wired LAN Driver Updates 2016-09-24

    This series contains updates to i40e and i40evf only.

    Harshitha removes the ability to set or advertise X722 to 100 Mbps,
    since it is not supported, so we should not be able to advertise or
    set the NIC to 100 Mbps.

    Alan fixes an issue where deleting a MAC filter did not really delete the
    filter in question. The reason being that the wrong cmd_flag is passed to
    the firmware.

    Preethi adds the encapsulation checksum offload negotiation flag, so that
    we can control it.

    Jake cleans up the ATR auto_disable_flags use, since some locations
    disable ATR accidentally using the "full" disable by disabling the flag
    in the standard flags field. This permanently forces ATR off instead of
    temporarily disabling it. Then updated checks to include when there are
    TCP/IP4 sideband rules in effect, where ATR should be disabled. Lastly,
    adds support to the i40evf driver for setting interrupt moderation values
    per queue, like in i40e.

    Henry cleans up unreachable code, since i40e_shutdown_adminq() is always
    true.

    Mitch enables support for adaptive interrupt throttling, since all the
    code for it is already in the interrupt handler. The fixes a rare
    case where we might get a VSI with no queues and we try to configure
    RSS, which would result in a divide by zero.

    Alex fixes an issue where transmit cleanup flow was incorrectly assuming
    it could check for the flow director bits after it had unmapped the
    buffer. Then adds a txring_txq() to allow us to convert a i40e_ring/
    i40evf_ring to a netdev_tx_queue structure, like ixgbe and fm10k. This
    avoids having to make a multi-line function call for all the areas that
    need access to it. Re-factors the Flow Director filter configuration
    out into a separate function, like we did for the standard xmit path.
    Cleans up the debugfs hook for Flow Director since it was meant for
    debug only.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • …git/dhowells/linux-fs

    David Howells says:

    ====================
    rxrpc: Implement slow-start and other bits

    This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
    improve performance and handling of occasional packet loss. This is more or
    less the same as TCP slow start [RFC 5681]. Firstly, there are some ACK
    generation improvements:

    (1) Send ACKs regularly to apprise the peer of our state so that they can do
    congestion management of their own.

    (2) Send an ACK when we fill in a hole in the buffer so that the peer can
    find out that we did this thus forestalling retransmission.

    (3) Note the final DATA packet's serial number in the final ACK for
    correlation purposes.

    and a couple of bug fixes:

    (4) Reinitialise the ACK state and clear the ACK and resend timers upon
    entering the client reply reception phase to kill off any pending probe
    ACKs.

    (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

    and then there's the slow-start pieces:

    (6) Summarise an ACK.

    (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
    try and find out what happened to it.

    (8) Implement the slow start feature.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • In commit a75e8005d506f3 ("i40e: queue-specific settings for interrupt
    moderation") the i40e driver gained support for setting interrupt
    moderation values per queue. This patch adds support for this feature
    to the i40evf driver as well. In addition, a few changes are made to
    the i40e implementation to add function header documentation comments,
    as well.

    This behaves in a similar fashion to the implementation in i40e. Thus,
    requesting the moderation value when no queue is provided will report
    queue 0 value, while setting the value without a queue will set all
    queues at once.

    Change-ID: I1f310a57c8e6c84a8524c178d44d1b7a6d3a848e
    Signed-off-by: Jacob Keller
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Jacob Keller
     
  • In some rare cases, we might get a VSI with no queues. In this case, we
    cannot configure RSS on this VSI as it will try to divide by zero when
    configuring the lookup table.

    Change-ID: I6ae173a7dd3481a081e079eb10eb80275de2adb0
    Signed-off-by: Mitch Williams
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Mitch Williams
     
  • This interface was only ever meant for debug only. Since it is not
    supposed to be here we are removing it.

    Change-ID: Id771a1e5e7d3e2b4b7f56591b61fb48c921e1d04
    Signed-off-by: Alexander Duyck
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Alexander Duyck
     
  • In an effort to improve code readability I am splitting the Flow Director
    filter configuration out into a separate function like we have done for the
    standard xmit path. The general idea is to provide a single block of code
    that translates the flow specification into a proper Flow Director
    descriptor.

    Change-ID: Id355ad8030c4e6c72c57504fa09de60c976a8ffe
    Signed-off-by: Alexander Duyck
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Alexander Duyck
     
  • This patch adds a txring_txq function which allows us to convert a
    i40e_ring/i40evf_ring to a netdev_tx_queue structure. This way we
    can avoid having to make a multi-line function call for all the spots
    that need access to this.

    Change-ID: Ic063b71d8b92ea406d2c32e798c8e2b02809d65b
    Signed-off-by: Alexander Duyck
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Alexander Duyck
     
  • The Tx cleanup flow was incorrectly assuming it could check for the flow
    director bits after it had unmapped the buffer. However in this case it
    results in us trying to free a raw_buf as though it is an sk_buff.

    To fix this I am moving up the flag test for the FD_SB bit so that when
    find a non-NULL skb or raw_buf value we then check the flag and use the
    appropriate call to free the buffer.

    Change-ID: I6284034ba1ea87c9922e56f6eb3181f7f09bddde
    Signed-off-by: Alexander Duyck
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Alexander Duyck
     
  • All of the code to support adaptive interrupt throttling is already in
    the interrupt handler, it just needs to be enabled. Fill out the data
    structures properly to make it happen. Single-flow traffic tests may
    show slightly lower throughput, but interrupts per second will drop by
    about 75%.

    Change-ID: I9cd7d42c025b906bf1bb85c6aeb6112684aa6471
    Signed-off-by: Mitch Williams
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Mitch Williams