10 May, 2019

1 commit

  • User space can flip the clean_acked_data_enabled static branch
    on and off with TLS offload when CONFIG_TLS_DEVICE is enabled.
    jump_label.h suggests we use the delayed version in this case.

    Deferred branches now also don't take the branch mutex on
    decrement, so we avoid potential locking issues.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

09 May, 2019

1 commit

  • inet_iif should be used for the raw socket lookup. inet_iif considers
    rt_iif which handles the case of local traffic.

    As it stands, ping to a local address with the '-I ' option fails
    ever since ping was changed to use SO_BINDTODEVICE instead of
    cmsg + IP_PKTINFO.

    IPv6 works fine.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

08 May, 2019

2 commits

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

    2) Add fib_sync_mem to control the amount of dirty memory we allow to
    queue up between synchronize RCU calls, from David Ahern.

    3) Make flow classifier more lockless, from Vlad Buslov.

    4) Add PHY downshift support to aquantia driver, from Heiner
    Kallweit.

    5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
    contention on SLAB spinlocks in heavy RPC workloads.

    6) Partial GSO offload support in XFRM, from Boris Pismenny.

    7) Add fast link down support to ethtool, from Heiner Kallweit.

    8) Use siphash for IP ID generator, from Eric Dumazet.

    9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
    entries, from David Ahern.

    10) Move skb->xmit_more into a per-cpu variable, from Florian
    Westphal.

    11) Improve eBPF verifier speed and increase maximum program size,
    from Alexei Starovoitov.

    12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
    spinlocks. From Neil Brown.

    13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

    14) Improve link partner cap detection in generic PHY code, from
    Heiner Kallweit.

    15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
    Maguire.

    16) Remove SKB list implementation assumptions in SCTP, your's truly.

    17) Various cleanups, optimizations, and simplifications in r8169
    driver. From Heiner Kallweit.

    18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

    19) Switch PHY drivers over to use dynamic featue detection, from
    Heiner Kallweit.

    20) Support flow steering without masking in dpaa2-eth, from Ioana
    Ciocoi.

    21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
    Pirko.

    22) Increase the strict parsing of current and future netlink
    attributes, also export such policies to userspace. From Johannes
    Berg.

    23) Allow DSA tag drivers to be modular, from Andrew Lunn.

    24) Remove legacy DSA probing support, also from Andrew Lunn.

    25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
    Haabendal.

    26) Add a generic tracepoint for TX queue timeouts to ease debugging,
    from Cong Wang.

    27) More indirect call optimizations, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
    cxgb4: Fix error path in cxgb4_init_module
    net: phy: improve pause mode reporting in phy_print_status
    dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
    net: macb: Change interrupt and napi enable order in open
    net: ll_temac: Improve error message on error IRQ
    net/sched: remove block pointer from common offload structure
    net: ethernet: support of_get_mac_address new ERR_PTR error
    net: usb: smsc: fix warning reported by kbuild test robot
    staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
    net: dsa: support of_get_mac_address new ERR_PTR error
    net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
    vrf: sit mtu should not be updated when vrf netdev is the link
    net: dsa: Fix error cleanup path in dsa_init_module
    l2tp: Fix possible NULL pointer dereference
    taprio: add null check on sched_nest to avoid potential null pointer dereference
    net: mvpp2: cls: fix less than zero check on a u32 variable
    net_sched: sch_fq: handle non connected flows
    net_sched: sch_fq: do not assume EDT packets are ordered
    net: hns3: use devm_kcalloc when allocating desc_cb
    net: hns3: some cleanup for struct hns3_enet_ring
    ...

    Linus Torvalds
     
  • Minor conflict with the DSA legacy code removal.

    Signed-off-by: David S. Miller

    David S. Miller
     

07 May, 2019

1 commit

  • Pull RCU updates from Ingo Molnar:
    "This cycles's RCU changes include:

    - a couple of straggling RCU flavor consolidation updates

    - SRCU updates

    - RCU CPU stall-warning updates

    - torture-test updates

    - an LKMM commit adding support for synchronize_srcu_expedited()

    - documentation updates

    - miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    net/ipv4/netfilter: Update comment from call_rcu_bh() to call_rcu()
    tools/memory-model: Add support for synchronize_srcu_expedited()
    doc/kprobes: Update obsolete RCU update functions
    torture: Suppress false-positive CONFIG_INITRAMFS_SOURCE complaint
    locktorture: NULL cxt.lwsa and cxt.lrsa to allow bad-arg detection
    rcuperf: Fix cleanup path for invalid perf_type strings
    rcutorture: Fix cleanup path for invalid torture_type strings
    rcutorture: Fix expected forward progress duration in OOM notifier
    rcutorture: Remove ->ext_irq_conflict field
    rcutorture: Make rcutorture_extend_mask() comment match the code
    tools/.../rcutorture: Convert to SPDX license identifier
    torture: Don't try to offline the last CPU
    rcu: Fix nohz status in stall warning
    rcu: Move forward-progress checkers into tree_stall.h
    rcu: Move irq-disabled stall-warning checking to tree_stall.h
    rcu: Organize functions in tree_stall.h
    rcu: Move FAST_NO_HZ stall-warning code to tree_stall.h
    rcu: Inline RCU stall-warning info helper functions
    rcu: Move rcu_print_task_exp_stall() to tree_exp.h
    rcu: Inline RCU task stall-warning helper functions
    ...

    Linus Torvalds
     

06 May, 2019

3 commits

  • Pablo Neira Ayuso says:

    ===================
    Netfilter updates for net-next

    The following batch contains Netfilter updates for net-next, they are:

    1) Move nft_expr_clone() to nft_dynset, from Paul Gortmaker.

    2) Do not include module.h from net/netfilter/nf_tables.h,
    also from Paul.

    3) Restrict conntrack sysctl entries to boolean, from Tonghao Zhang.

    4) Several patches to add infrastructure to autoload NAT helper
    modules from their respective conntrack helper, this also includes
    the first client of this code in OVS, patches from Flavio Leitner.

    5) Add support to match for conntrack ID, from Brett Mastbergen.

    6) Spelling fix in connlabel, from Colin Ian King.

    7) Use struct_size() from hashlimit, from Gustavo A. R. Silva.

    8) Add optimized version of nf_inet_addr_mask(), from Li RongQing.
    ===================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • So that we avoid another indirect call per RX packet, if
    early demux is enabled.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • So that we avoid another indirect call per RX packet in the common
    case.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

05 May, 2019

3 commits

  • Similar to the cached routes, make IPv4 exceptions accessible when
    using an IPv6 nexthop struct with IPv4 routes. Simplify the exception
    functions by passing in fib_nh_common since that is all it needs,
    and then cleanup the call sites that have extraneous fib_nh conversions.

    As with the cached routes this is a change in location only, from fib_nh
    up to fib_nh_common; no functional change intended.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    David Ahern
     
  • Now that the cached routes are in fib_nh_common, pass it to
    rt_cache_route and simplify its callers. For rt_set_nexthop,
    the tclassid becomes the last user of fib_nh so move the
    container_of under the #ifdef CONFIG_IP_ROUTE_CLASSID.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    David Ahern
     
  • While the cached routes, nh_pcpu_rth_output and nh_rth_input, are IPv4
    specific, a later patch wants to make them accessible for IPv6 nexthops
    with IPv4 routes using a fib6_nh. Move the cached routes from fib_nh to
    fib_nh_common and update references.

    Initialization of the cached entries is moved to fib_nh_common_init,
    and free is moved to fib_nh_common_release.

    Change in location only, from fib_nh up to fib_nh_common; no functional
    change intended.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    David Ahern
     

04 May, 2019

1 commit

  • e is the counter used to save the location of a dump when an
    skb is filled. Once the walk of the table is complete, mr_table_dump
    needs to return without resetting that index to 0. Dump of a specific
    table is looping because of the reset because there is no way to
    indicate the walk of the table is done.

    Move the reset to the caller so the dump of each table starts at 0,
    but the loop counter is maintained if a dump fills an skb.

    Fixes: e1cedae1ba6b0 ("ipmr: Refactor mr_rtm_dumproute")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

03 May, 2019

1 commit


02 May, 2019

2 commits

  • syzbot was able to crash host by sending UDP packets with a 0 payload.

    TCP does not have this issue since we do not aggregate packets without
    payload.

    Since dev_gro_receive() sets gso_size based on skb_gro_len(skb)
    it seems not worth trying to cope with padded packets.

    BUG: KASAN: slab-out-of-bounds in skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
    Read of size 16 at addr ffff88808893fff0 by task syz-executor612/7889

    CPU: 0 PID: 7889 Comm: syz-executor612 Not tainted 5.1.0-rc7+ #96
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
    kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    __asan_report_load16_noabort+0x14/0x20 mm/kasan/generic_report.c:133
    skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
    udp_gro_receive_segment net/ipv4/udp_offload.c:382 [inline]
    call_gro_receive include/linux/netdevice.h:2349 [inline]
    udp_gro_receive+0xb61/0xfd0 net/ipv4/udp_offload.c:414
    udp4_gro_receive+0x763/0xeb0 net/ipv4/udp_offload.c:478
    inet_gro_receive+0xe72/0x1110 net/ipv4/af_inet.c:1510
    dev_gro_receive+0x1cd0/0x23c0 net/core/dev.c:5581
    napi_gro_frags+0x36b/0xd10 net/core/dev.c:5843
    tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981
    tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027
    call_write_iter include/linux/fs.h:1866 [inline]
    do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681
    do_iter_write fs/read_write.c:957 [inline]
    do_iter_write+0x184/0x610 fs/read_write.c:938
    vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002
    do_writev+0x15e/0x370 fs/read_write.c:1037
    __do_sys_writev fs/read_write.c:1110 [inline]
    __se_sys_writev fs/read_write.c:1107 [inline]
    __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x441cc0
    Code: 05 48 3d 01 f0 ff ff 0f 83 9d 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d 51 93 29 00 00 75 14 b8 14 00 00 00 0f 05 3d 01 f0 ff ff 0f 83 74 09 fc ff c3 48 83 ec 08 e8 ba 2b 00 00
    RSP: 002b:00007ffe8c716118 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
    RAX: ffffffffffffffda RBX: 00007ffe8c716150 RCX: 0000000000441cc0
    RDX: 0000000000000001 RSI: 00007ffe8c716170 RDI: 00000000000000f0
    RBP: 0000000000000000 R08: 000000000000ffff R09: 0000000000a64668
    R10: 0000000020000040 R11: 0000000000000246 R12: 000000000000c2d9
    R13: 0000000000402b50 R14: 0000000000000000 R15: 0000000000000000

    Allocated by task 5143:
    save_stack+0x45/0xd0 mm/kasan/common.c:75
    set_track mm/kasan/common.c:87 [inline]
    __kasan_kmalloc mm/kasan/common.c:497 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc mm/slab.c:3393 [inline]
    kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555
    mm_alloc+0x1d/0xd0 kernel/fork.c:1030
    bprm_mm_init fs/exec.c:363 [inline]
    __do_execve_file.isra.0+0xaa3/0x23f0 fs/exec.c:1791
    do_execveat_common fs/exec.c:1865 [inline]
    do_execve fs/exec.c:1882 [inline]
    __do_sys_execve fs/exec.c:1958 [inline]
    __se_sys_execve fs/exec.c:1953 [inline]
    __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 5351:
    save_stack+0x45/0xd0 mm/kasan/common.c:75
    set_track mm/kasan/common.c:87 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
    __cache_free mm/slab.c:3499 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3765
    __mmdrop+0x238/0x320 kernel/fork.c:677
    mmdrop include/linux/sched/mm.h:49 [inline]
    finish_task_switch+0x47b/0x780 kernel/sched/core.c:2746
    context_switch kernel/sched/core.c:2880 [inline]
    __schedule+0x81b/0x1cc0 kernel/sched/core.c:3518
    preempt_schedule_irq+0xb5/0x140 kernel/sched/core.c:3745
    retint_kernel+0x1b/0x2d
    arch_local_irq_restore arch/x86/include/asm/paravirt.h:767 [inline]
    kmem_cache_free+0xab/0x260 mm/slab.c:3766
    anon_vma_chain_free mm/rmap.c:134 [inline]
    unlink_anon_vmas+0x2ba/0x870 mm/rmap.c:401
    free_pgtables+0x1af/0x2f0 mm/memory.c:394
    exit_mmap+0x2d1/0x530 mm/mmap.c:3144
    __mmput kernel/fork.c:1046 [inline]
    mmput+0x15f/0x4c0 kernel/fork.c:1067
    exec_mmap fs/exec.c:1046 [inline]
    flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279
    load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864
    search_binary_handler fs/exec.c:1656 [inline]
    search_binary_handler+0x17f/0x570 fs/exec.c:1634
    exec_binprm fs/exec.c:1698 [inline]
    __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818
    do_execveat_common fs/exec.c:1865 [inline]
    do_execve fs/exec.c:1882 [inline]
    __do_sys_execve fs/exec.c:1958 [inline]
    __se_sys_execve fs/exec.c:1953 [inline]
    __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff88808893f7c0
    which belongs to the cache mm_struct of size 1496
    The buggy address is located 600 bytes to the right of
    1496-byte region [ffff88808893f7c0, ffff88808893fd98)
    The buggy address belongs to the page:
    page:ffffea0002224f80 count:1 mapcount:0 mapping:ffff88821bc40ac0 index:0xffff88808893f7c0 compound_mapcount: 0
    flags: 0x1fffc0000010200(slab|head)
    raw: 01fffc0000010200 ffffea00025b4f08 ffffea00027b9d08 ffff88821bc40ac0
    raw: ffff88808893f7c0 ffff88808893e440 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88808893fe80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88808893ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88808893ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ^
    ffff888088940000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888088940080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Previously, during fragmentation after forwarding, skb->skb_iif isn't
    preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
    'from' skb.

    As a result, ip_do_fragment's creates fragments with zero skb_iif,
    leading to inconsistent behavior.

    Assume for example an eBPF program attached at tc egress (post
    forwarding) that examines __sk_buff->ingress_ifindex:
    - the correct iif is observed if forwarding path does not involve
    fragmentation/refragmentation
    - a bogus iif is observed if forwarding path involves
    fragmentation/refragmentatiom

    Fix, by preserving skb_iif during 'ip_copy_metadata'.

    Signed-off-by: Shmulik Ladkani
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     

01 May, 2019

8 commits

  • Relocate the congestion window initialization from tcp_init_metrics()
    to tcp_init_transfer() to improve code readability.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Use a helper to consolidate two identical code block for passive TFO.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch makes passive Fast Open reverts the cwnd to default
    initial cwnd (10 packets) if the SYNACK timeout is spurious.

    Passive Fast Open uses a full socket during handshake so it can
    use the existing undo logic to detect spurious retransmission
    by recording the first SYNACK timeout in key state variable
    retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
    has sent some data before the timeout, the spurious timeout
    is detected by tcp_try_undo_recovery() in tcp_process_loss()
    in tcp_ack().

    But if the socket has not send any data yet, tcp_ack() does not
    execute the undo code since no data is acknowledged. The fix is to
    check such case explicitly after tcp_ack() during the ACK processing
    in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
    in case the server closes the socket before handshake completes.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • TCP sender would use congestion window of 1 packet on the second SYN
    and SYNACK timeout except passive TCP Fast Open. This makes passive
    TFO too aggressive and unfair during congestion at handshake. This
    patch fixes this issue so TCP (fast open or not, passive or active)
    always conforms to the RFC6298.

    Note that tcp_enter_loss() is called only once during recurring
    timeouts. This is because during handshake, high_seq and snd_una
    are the same so tcp_enter_loss() would incorrect set the undo state
    variables multiple times.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Linux implements RFC6298 and use an initial congestion window
    of 1 upon establishing the connection if the SYNACK packet is
    retransmitted 2 or more times. In cellular networks SYNACK timeouts
    are often spurious if the wireless radio was dormant or idle. Also
    some network path is longer than the default SYNACK timeout. In
    both cases falsely starting with a minimal cwnd are detrimental
    to performance.

    This patch avoids doing so when the final ACK's TCP timestamp
    indicates the original SYNACK was delivered. It remembers the
    original SYNACK timestamp when SYNACK timeout has occurred and
    re-uses the function to detect spurious SYN timeout conveniently.

    Note that a server may receives multiple SYNs from and immediately
    retransmits SYNACKs without any SYNACK timeout. This often happens
    on when the client SYNs have timed out due to wireless delay
    above. In this case since the server will still use the default
    initial congestion (e.g. 10) because tp->undo_marker is reset in
    tcp_init_metrics(). This is an intentional design because packets
    are not lost but delayed.

    This patch only covers regular TCP passive open. Fast Open is
    supported in the next patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Detecting spurious SYNACK timeout using timestamp option requires
    recording the exact SYNACK skb timestamp. Previously the SYNACK
    sent timestamp was stamped slightly earlier before the skb
    was transmitted. This patch uses the SYNACK skb transmission
    timestamp directly.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Linux implements RFC6298 and use an initial congestion window of 1
    upon establishing the connection if the SYN packet is retransmitted 2
    or more times. In cellular networks SYN timeouts are often spurious
    if the wireless radio was dormant or idle. Also some network path
    is longer than the default SYN timeout. Having a minimal cwnd on
    both cases are detrimental to TCP startup performance.

    This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect
    spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp
    records the initial SYN timestamp instead of first retransmission, we
    have to implement a different undo code additionally. The detection
    also must happen before tcp_ack() as retrans_stamp is reset when
    SYN is acknowledged.

    Note this patch covers both active regular and fast open.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously if an active TCP open has SYN timeout, it always undo the
    cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue
    would reset tp->retrans_stamp when SYN is acked, which fools then
    tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is
    required to properly support undo for spurious SYN timeout.

    Fixing this is tricky -- for active TCP open tp->retrans_stamp
    records the time when the handshake starts, not the first
    retransmission time as the name may suggest. The simplest fix is
    for tcp_packet_delayed to ensure it is valid before comparing with
    other timestamp.

    One side effect of this change is active TCP Fast Open that incurred
    SYN timeout. Upon receiving a SYN-ACK that only acknowledged
    the SYN, it would immediately retransmit unacknowledged data in
    tcp_ack() because the data is marked lost after SYN timeout. But
    the retransmission would have an incorrect ack sequence number since
    rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the
    retransmission needs to properly handed by tcp_rcv_fastopen_synack()
    like before.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

30 Apr, 2019

4 commits

  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2019-04-30

    1) A lot of work to remove indirections from the xfrm code.
    From Florian Westphal.

    2) Support ESP offload in combination with gso partial.
    From Boris Pismenny.

    3) Remove some duplicated code from vti4.
    From Jeremy Sowden.

    Please note that there is merge conflict

    between commit:

    8742dc86d0c7 ("xfrm4: Fix uninitialized memory read in _decode_session4")

    from the ipsec tree and commit:

    c53ac41e3720 ("xfrm: remove decode_session indirection from afinfo_policy")

    from the ipsec-next tree. The merge conflict will appear
    when those trees get merged during the merge window.
    The conflict can be solved as it is done in linux-next:

    https://lkml.org/lkml/2019/4/25/1207

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2019-04-30

    1) Fix an out-of-bound array accesses in __xfrm_policy_unlink.
    From YueHaibing.

    2) Reset the secpath on failure in the ESP GRO handlers
    to avoid dereferencing an invalid pointer on error.
    From Myungho Jung.

    3) Add and revert a patch that tried to add rcu annotations
    to netns_xfrm. From Su Yanjun.

    4) Wait for rcu callbacks before freeing xfrm6_tunnel_spi_kmem.
    From Su Yanjun.

    5) Fix forgotten vti4 ipip tunnel deregistration.
    From Jeremy Sowden:

    6) Remove some duplicated log messages in vti4.
    From Jeremy Sowden.

    7) Don't use IPSEC_PROTO_ANY when flushing states because
    this will flush only IPsec portocol speciffic states.
    IPPROTO_ROUTING states may remain in the lists when
    doing net exit. Fix this by replacing IPSEC_PROTO_ANY
    with zero. From Cong Wang.

    8) Add length check for UDP encapsulation to fix "Oversized IP packet"
    warnings on receive side. From Sabrina Dubroca.

    9) Fix xfrm interface lookup when the interface is associated to
    a vrf layer 3 master device. From Martin Willi.

    10) Reload header pointers after pskb_may_pull() in _decode_session4(),
    otherwise we may read from uninitialized memory.

    11) Update the documentation about xfrm[46]_gc_thresh, it
    is not used anymore after the flowcache removal.
    From Nicolas Dichtel.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Each NAT helper creates a module alias which follows a pattern.
    Use macros for consistency.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Pablo Neira Ayuso

    Flavio Leitner
     
  • Richard and Bruno both reported that my commit added a bug,
    and Bruno was able to determine the problem came when a segment
    wih a FIN packet was coalesced to a prior one in tcp backlog queue.

    It turns out the header prediction in tcp_rcv_established()
    looks back to TCP headers in the packet, not in the metadata
    (aka TCP_SKB_CB(skb)->tcp_flags)

    The fast path in tcp_rcv_established() is not supposed to
    handle a FIN flag (it does not call tcp_fin())

    Therefore we need to make sure to propagate the FIN flag,
    so that the coalesced packet does not go through the fast path,
    the same than a GRO packet carrying a FIN flag.

    While we are at it, make sure we do not coalesce packets with
    RST or SYN, or if they do not have ACK set.

    Many thanks to Richard and Bruno for pinpointing the bad commit,
    and to Richard for providing a first version of the fix.

    Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
    Signed-off-by: Eric Dumazet
    Reported-by: Richard Purdie
    Reported-by: Bruno Prémont
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2019

4 commits

  • Currently, the UDP GRO code path does bad things on some edge
    conditions - Aggregation can happen even on packet with different
    lengths.

    Fix the above by rewriting the 'complete' condition for GRO
    packets. While at it, note explicitly that we allow merging the
    first packet per burst below gso_size.

    Reported-by: Sean Tong
    Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Add options to strictly validate messages and dump messages,
    sometimes perhaps validating dump messages non-strictly may
    be required, so add an option for that as well.

    Since none of this can really be applied to existing commands,
    set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
    {
    .cmd = X,
    + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
    ...
    },
    ...
    };

    For new commands one should just not copy the .validate 'opt-out'
    flags and thus get strict validation.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

26 Apr, 2019

1 commit


25 Apr, 2019

1 commit

  • Before calling __ip_options_compile(), we need to ensure the network
    header is a an IPv4 one, and that it is already pulled in skb->head.

    RAW sockets going through a tunnel can end up calling ipv4_link_failure()
    with total garbage in the skb, or arbitrary lengthes.

    syzbot report :

    BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
    BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
    Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204

    CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
    kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    check_memory_region_inline mm/kasan/generic.c:185 [inline]
    check_memory_region+0x123/0x190 mm/kasan/generic.c:191
    memcpy+0x38/0x50 mm/kasan/common.c:133
    memcpy include/linux/string.h:355 [inline]
    __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
    __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
    ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
    dst_link_failure include/net/dst.h:427 [inline]
    vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
    vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
    __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
    netdev_start_xmit include/linux/netdevice.h:4423 [inline]
    xmit_one net/core/dev.c:3292 [inline]
    dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
    __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
    dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
    neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
    neigh_output include/net/neighbour.h:508 [inline]
    ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
    ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
    NF_HOOK_COND include/linux/netfilter.h:278 [inline]
    ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
    dst_output include/net/dst.h:444 [inline]
    NF_HOOK include/linux/netfilter.h:289 [inline]
    raw_send_hdrinc net/ipv4/raw.c:432 [inline]
    raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
    inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
    sock_sendmsg_nosec net/socket.c:651 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:661
    sock_write_iter+0x27c/0x3e0 net/socket.c:988
    call_write_iter include/linux/fs.h:1866 [inline]
    new_sync_write+0x4c7/0x760 fs/read_write.c:474
    __vfs_write+0xe4/0x110 fs/read_write.c:487
    vfs_write+0x20c/0x580 fs/read_write.c:549
    ksys_write+0x14f/0x2d0 fs/read_write.c:599
    __do_sys_write fs/read_write.c:611 [inline]
    __se_sys_write fs/read_write.c:608 [inline]
    __x64_sys_write+0x73/0xb0 fs/read_write.c:608
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x458c29
    Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
    RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
    RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
    R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff

    The buggy address belongs to the page:
    page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
    flags: 0x1fffc0000000000()
    raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
    raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
    ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
    ^
    ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
    ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure")
    Signed-off-by: Eric Dumazet
    Cc: Stephen Suryaputra
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2019

2 commits

  • nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh).
    All of the RTNH_F_ flags fit in an unsigned char, and since the API to
    userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not
    grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the
    size of the struct by 8, from 56 to 48 bytes.

    Update the flags arguments for up netdevice events and fib_nexthop_info
    which determines the RTNH_F flags to return on a dump/event. The RTNH_F
    flags are passed in the lower byte of rtm_flags which is an unsigned int
    so use a temp variable for the flags to fib_nexthop_info and combine
    with rtm_flags in the caller.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Currently, lwtunnel_fill_encap hardcodes the encap and encap type
    attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop
    objects want to re-use this code but the encap attributes passed to
    userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only
    difference, change lwtunnel_fill_encap to take the attribute type as
    an input.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

23 Apr, 2019

5 commits