26 Sep, 2018

1 commit

  • [ Upstream commit 5cf4a8532c992bb22a9ecd5f6d93f873f4eaccc2 ]

    According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
    flag was introduced because send(2) ignores unknown message flags and
    any legacy application which was accidentally passing the equivalent of
    MSG_ZEROCOPY earlier should not see any new behaviour.

    Before commit f214f915e7db ("tcp: enable MSG_ZEROCOPY"), a send(2) call
    which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
    would succeed. However, after that commit, it fails with -ENOBUFS. So
    it appears that the SO_ZEROCOPY flag fails to fulfill its intended
    purpose. Fix it.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Vincent Whitchurch
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vincent Whitchurch
     

20 Sep, 2018

3 commits

  • After working on IP defragmentation lately, I found that some large
    packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
    zero paddings on the last (small) fragment.

    While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
    to CHECKSUM_NONE, forcing a full csum validation, even if all prior
    fragments had CHECKSUM_COMPLETE set.

    We can instead compute the checksum of the part we are trimming,
    usually smaller than the part we keep.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Tested: see the next patch is the series.

    Suggested-by: Eric Dumazet
    Signed-off-by: Peter Oskolkov
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • As measured in my prior patch ("sch_netem: faster rb tree removal"),
    rbtree_postorder_for_each_entry_safe() is nice looking but much slower
    than using rb_next() directly, except when tree is small enough
    to fit in CPU caches (then the cost is the same)

    Also note that there is not even an increase of text size :
    $ size net/core/skbuff.o.before net/core/skbuff.o
    text data bss dec hex filename
    40711 1298 0 42009 a419 net/core/skbuff.o.before
    40711 1298 0 42009 a419 net/core/skbuff.o

    From: Eric Dumazet

    Signed-off-by: David S. Miller
    (cherry picked from commit 7c90584c66cc4b033a3b684b0e0950f79e7b7166)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

28 Jul, 2018

1 commit

  • [ Upstream commit ff907a11a0d68a749ce1a321f4505c03bf72190c ]

    syzbot caught a NULL deref [1], caused by skb_segment()

    skb_segment() has many "goto err;" that assume the @err variable
    contains -ENOMEM.

    A successful call to __skb_linearize() should not clear @err,
    otherwise a subsequent memory allocation error could return NULL.

    While we are at it, we might use -EINVAL instead of -ENOMEM when
    MAX_SKB_FRAGS limit is reached.

    [1]
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
    Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 b6 14 08 48 8d 86 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
    RSP: 0018:ffff88019b7fd060 EFLAGS: 00010206
    RAX: 0000000000000012 RBX: 0000000000000020 RCX: dffffc0000000000
    RDX: 0000000000040000 RSI: 0000000000000000 RDI: 0000000000000090
    RBP: ffff88019b7fd0f0 R08: ffff88019510e0c0 R09: ffffed003b5c46d6
    R10: ffffed003b5c46d6 R11: ffff8801dae236b3 R12: 0000000000000001
    R13: ffff8801d6c581f4 R14: 0000000000000000 R15: ffff8801d6c58128
    FS: 00007fcae64d6700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004e8664 CR3: 00000001b669b000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
    inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
    inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
    skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
    __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
    skb_gso_segment include/linux/netdevice.h:4099 [inline]
    validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
    __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
    neigh_hh_output include/net/neighbour.h:473 [inline]
    neigh_output include/net/neighbour.h:481 [inline]
    ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
    ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
    NF_HOOK_COND include/linux/netfilter.h:276 [inline]
    ip_output+0x223/0x880 net/ipv4/ip_output.c:405
    dst_output include/net/dst.h:444 [inline]
    ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
    iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
    ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
    ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
    __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
    netdev_start_xmit include/linux/netdevice.h:4157 [inline]
    xmit_one net/core/dev.c:3034 [inline]
    dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
    __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
    neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
    neigh_output include/net/neighbour.h:483 [inline]
    ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
    ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
    NF_HOOK_COND include/linux/netfilter.h:276 [inline]
    ip_output+0x223/0x880 net/ipv4/ip_output.c:405
    dst_output include/net/dst.h:444 [inline]
    ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
    ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
    tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
    tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
    __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
    tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
    tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
    tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
    inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
    sock_sendmsg_nosec net/socket.c:641 [inline]
    sock_sendmsg+0xd5/0x120 net/socket.c:651
    __sys_sendto+0x3d7/0x670 net/socket.c:1797
    __do_sys_sendto net/socket.c:1809 [inline]
    __se_sys_sendto net/socket.c:1805 [inline]
    __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x455ab9
    Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fcae64d5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00007fcae64d66d4 RCX: 0000000000455ab9
    RDX: 0000000000000001 RSI: 0000000020000200 RDI: 0000000000000013
    RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
    R13: 00000000004c1145 R14: 00000000004d1818 R15: 0000000000000006
    Modules linked in:
    Dumping ftrace buffer:
    (ftrace buffer empty)

    Fixes: ddff00d42043 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation")
    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Reported-by: syzbot
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

25 Jul, 2018

2 commits

  • [ Upstream commit e78bfb0751d4e312699106ba7efbed2bab1a53ca ]

    Commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in
    __copy_skb_header()") introduced a different handling for the
    pfmemalloc flag in copy and clone paths.

    In __skb_clone(), now, the flag is set only if it was set in the
    original skb, but not cleared if it wasn't. This is wrong and
    might lead to socket buffers being flagged with pfmemalloc even
    if the skb data wasn't allocated from pfmemalloc reserves. Copy
    the flag instead of ORing it.

    Reported-by: Sabrina Dubroca
    Fixes: 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
    Signed-off-by: Stefano Brivio
    Tested-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Brivio
     
  • [ Upstream commit 8b7008620b8452728cadead460a36f64ed78c460 ]

    The pfmemalloc flag indicates that the skb was allocated from
    the PFMEMALLOC reserves, and the flag is currently copied on skb
    copy and clone.

    However, an skb copied from an skb flagged with pfmemalloc
    wasn't necessarily allocated from PFMEMALLOC reserves, and on
    the other hand an skb allocated that way might be copied from an
    skb that wasn't.

    So we should not copy the flag on skb copy, and rather decide
    whether to allow an skb to be associated with sockets unrelated
    to page reclaim depending only on how it was allocated.

    Move the pfmemalloc flag before headers_start[0] using an
    existing 1-bit hole, so that __copy_skb_header() doesn't copy
    it.

    When cloning, we'll now take care of this flag explicitly,
    contravening to the warning comment of __skb_clone().

    While at it, restore the newline usage introduced by commit
    b19372273164 ("net: reorganize sk_buff for faster
    __copy_skb_header()") to visually separate bytes used in
    bitfields after headers_start[0], that was gone after commit
    a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage
    area"), and describe the pfmemalloc flag in the kernel-doc
    structure comment.

    This doesn't change the size of sk_buff or cacheline boundaries,
    but consolidates the 15 bits hole before tc_index into a 2 bytes
    hole before csum, that could now be filled more easily.

    Reported-by: Patrick Talbert
    Fixes: c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
    Signed-off-by: Stefano Brivio
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Brivio
     

30 May, 2018

2 commits

  • [ Upstream commit ae4745730cf8e693d354ccd4dbaf59ea440c09a9 ]

    In some situation vlan packets do not have ethernet headers. One example
    is packets from tun devices. Users can specify vlan protocol in tun_pi
    field instead of IP protocol, and skb_vlan_untag() attempts to untag such
    packets.

    skb_vlan_untag() (more precisely, skb_reorder_vlan_header() called by it)
    however did not expect packets without ethernet headers, so in such a case
    size argument for memmove() underflowed and triggered crash.

    ====
    BUG: unable to handle kernel paging request at ffff8801cccb8000
    IP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
    PGD 9cee067 P4D 9cee067 PUD 1d9401063 PMD 1cccb7063 PTE 2810100028101
    Oops: 000b [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 17663 Comm: syz-executor2 Not tainted 4.16.0-rc7+ #368
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
    RSP: 0018:ffff8801cc046e28 EFLAGS: 00010287
    RAX: ffff8801ccc244c4 RBX: fffffffffffffffe RCX: fffffffffff6c4c2
    RDX: fffffffffffffffe RSI: ffff8801cccb7ffc RDI: ffff8801cccb8000
    RBP: ffff8801cc046e48 R08: ffff8801ccc244be R09: ffffed0039984899
    R10: 0000000000000001 R11: ffffed0039984898 R12: ffff8801ccc244c4
    R13: ffff8801ccc244c0 R14: ffff8801d96b7c06 R15: ffff8801d96b7b40
    FS: 00007febd562d700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8801cccb8000 CR3: 00000001ccb2f006 CR4: 00000000001606e0
    DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    Call Trace:
    memmove include/linux/string.h:360 [inline]
    skb_reorder_vlan_header net/core/skbuff.c:5031 [inline]
    skb_vlan_untag+0x470/0xc40 net/core/skbuff.c:5061
    __netif_receive_skb_core+0x119c/0x3460 net/core/dev.c:4460
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4627
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4701
    netif_receive_skb+0xae/0x390 net/core/dev.c:4725
    tun_rx_batched.isra.50+0x5ee/0x870 drivers/net/tun.c:1555
    tun_get_user+0x299e/0x3c20 drivers/net/tun.c:1962
    tun_chr_write_iter+0xb9/0x160 drivers/net/tun.c:1990
    call_write_iter include/linux/fs.h:1782 [inline]
    new_sync_write fs/read_write.c:469 [inline]
    __vfs_write+0x684/0x970 fs/read_write.c:482
    vfs_write+0x189/0x510 fs/read_write.c:544
    SYSC_write fs/read_write.c:589 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:581
    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    RIP: 0033:0x454879
    RSP: 002b:00007febd562cc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 00007febd562d6d4 RCX: 0000000000454879
    RDX: 0000000000000157 RSI: 0000000020000180 RDI: 0000000000000014
    RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 00000000000006b0 R14: 00000000006fc120 R15: 0000000000000000
    Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
    RIP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: ffff8801cc046e28
    CR2: ffff8801cccb8000
    ====

    We don't need to copy headers for packets which do not have preceding
    headers of vlan headers, so skip memmove() in that case.

    Fixes: 4bbb3e0e8239 ("net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off")
    Reported-by: Eric Dumazet
    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Toshiaki Makita
     
  • [ Upstream commit 4bbb3e0e8239f9079bf1fe20b3c0cb598714ae61 ]

    When we have a bridge with vlan_filtering on and a vlan device on top of
    it, packets would be corrupted in skb_vlan_untag() called from
    br_dev_xmit().

    The problem sits in skb_reorder_vlan_header() used in skb_vlan_untag(),
    which makes use of skb->mac_len. In this function mac_len is meant for
    handling rx path with vlan devices with reorder_header disabled, but in
    tx path mac_len is typically 0 and cannot be used, which is the problem
    in this case.

    The current code even does not properly handle rx path (skb_vlan_untag()
    called from __netif_receive_skb_core()) with reorder_header off actually.

    In rx path single tag case, it works as follows:

    - Before skb_reorder_vlan_header()

    mac_header data
    v v
    +-------------------+-------------+------+----
    | ETH | VLAN | ETH |
    | ADDRS | TPID | TCI | TYPE |
    +-------------------+-------------+------+----


    to be removed

    - After skb_reorder_vlan_header()

    mac_header data
    v v
    +-------------------+------+----
    | ETH | ETH |
    | ADDRS | TYPE |
    +-------------------+------+----

    This is ok, but in rx double tag case, it corrupts packets:

    - Before skb_reorder_vlan_header()

    mac_header data
    v v
    +-------------------+-------------+-------------+------+----
    | ETH | VLAN | VLAN | ETH |
    | ADDRS | TPID | TCI | TPID | TCI | TYPE |
    +-------------------+-------------+-------------+------+----


    should be removed

    actually will be removed

    - After skb_reorder_vlan_header()

    mac_header data
    v v
    +-------------------+------+----
    | ETH | ETH |
    | ADDRS | TYPE |
    +-------------------+------+----

    So, two of vlan tags are both removed while only inner one should be
    removed and mac_header (and mac_len) is broken.

    skb_vlan_untag() is meant for removing the vlan header at (skb->data - 2),
    so use skb->data and skb->mac_header to calculate the right offset.

    Reported-by: Brandon Carpenter
    Fixes: a6e18ff11170 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off")
    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Toshiaki Makita
     

16 May, 2018

1 commit

  • commit b13dda9f9aa7caceeee61c080c2e544d5f5d85e5 upstream.

    syzbot reported __skb_try_recv_from_queue() was using skb->peeked
    while it was potentially unitialized.

    We need to clear it in __skb_clone()

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

01 Apr, 2018

1 commit

  • [ Upstream commit 6e5d58fdc9bedd0255a8781b258f10bbdc63e975 ]

    When errors are enqueued to the error queue via sock_queue_err_skb()
    function, it is possible that the waiting application is not notified.

    Calling 'sk->sk_data_ready()' would not notify applications that
    selected only POLLERR events in poll() (for example).

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: Randy E. Witt
    Reviewed-by: Eric Dumazet
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vinicius Costa Gomes
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

03 Jan, 2018

4 commits

  • skb_copy_ubufs must unclone before it is safe to modify its
    skb_shared_info with skb_zcopy_clear.

    Commit b90ddd568792 ("skbuff: skb_copy_ubufs must release uarg even
    without user frags") ensures that all skbs release their zerocopy
    state, even those without frags.

    But I forgot an edge case where such an skb arrives that is cloned.

    The stack does not build such packets. Vhost/tun skbs have their
    frags orphaned before cloning. TCP skbs only attach zerocopy state
    when a frag is added.

    But if TCP packets can be trimmed or linearized, this might occur.
    Tracing the code I found no instance so far (e.g., skb_linearize
    ends up calling skb_zcopy_clear if !skb->data_len).

    Still, it is non-obvious that no path exists. And it is fragile to
    rely on this.

    Fixes: b90ddd568792 ("skbuff: skb_copy_ubufs must release uarg even without user frags")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit b90ddd568792bcb0054eaf0f61785c8f80c3bd1c ]

    skb_copy_ubufs creates a private copy of frags[] to release its hold
    on user frags, then calls uarg->callback to notify the owner.

    Call uarg->callback even when no frags exist. This edge case can
    happen when zerocopy_sg_from_iter finds enough room in skb_headlen
    to copy all the data.

    Fixes: 3ece782693c4 ("sock: skb_copy_ubufs support for compound pages")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 268b790679422a89e9ab0685d9f291edae780c98 ]

    Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
    calls to skb_uarg(skb)->callback for the same data.

    skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
    with each segment. This is only safe for uargs that do refcounting,
    which is those that pass skb_orphan_frags without dropping their
    shared frags. For others, skb_orphan_frags drops the user frags and
    sets the uarg to NULL, after which sock_zerocopy_clone has no effect.

    Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
    calls for the same data causing the vhost_net_ubuf_ref_>refcount to
    drop below zero.

    Link: http://lkml.kernel.org/r/
    Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
    Reported-by: Andreas Hartmann
    Reported-by: David Hill
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 35b99dffc3f710cafceee6c8c6ac6a98eb2cb4bf ]

    skb_complete_tx_timestamp must ingest the skb it is passed. Call
    kfree_skb if the skb cannot be enqueued.

    Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl")
    Fixes: 9ac25fc06375 ("net: fix socket refcounting in skb_complete_tx_timestamp()")
    Reported-by: Richard Cochran
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

04 Nov, 2017

1 commit

  • When run ipvs in two different network namespace at the same host, and one
    ipvs transport network traffic to the other network namespace ipvs.
    'ipvs_property' flag will make the second ipvs take no effect. So we should
    clear 'ipvs_property' when SKB network namespace changed.

    Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
    Signed-off-by: Ye Yin
    Signed-off-by: Wei Zhou
    Signed-off-by: Julian Anastasov
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Ye Yin
     

22 Oct, 2017

1 commit

  • Syzkaller hits WARN_ON(sk->sk_wmem_queued) in sk_stream_kill_queues
    after triggering an EFAULT in __zerocopy_sg_from_iter.

    On this error, skb_zerocopy_stream_iter resets the skb to its state
    before the operation with __pskb_trim. It cannot kfree_skb like
    datagram callers, as the skb may have data from a previous send call.

    __pskb_trim calls skb_condense for unowned skbs, which adjusts their
    truesize. These tcp skbuffs are owned and their truesize must add up
    to sk_wmem_queued. But they match because their skb->sk is NULL until
    tcp_transmit_skb.

    Temporarily set skb->sk when calling __pskb_trim to signal that the
    skbuffs are owned and avoid the skb_condense path.

    Fixes: 52267790ef52 ("sock: add MSG_ZEROCOPY")
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

15 Oct, 2017

1 commit


08 Sep, 2017

1 commit

  • After commit 0ddf3fb2c43d ("udp: preserve skb->dst if required
    for IP options processing") we clear the skb head state as soon
    as the skb carrying them is first processed.

    Since the same skb can be processed several times when MSG_PEEK
    is used, we can end up lacking the required head states, and
    eventually oopsing.

    Fix this clearing the skb head state only when processing the
    last skb reference.

    Reported-by: Eric Dumazet
    Fixes: 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

02 Sep, 2017

3 commits


24 Aug, 2017

1 commit

  • Rename skb_pad() into __skb_pad() and make it take a third argument:
    free_on_error which controls whether kfree_skb() should be called or
    not, skb_pad() directly makes use of it and passes true to preserve its
    existing behavior. Do exactly the same thing with __skb_put_padto() and
    skb_put_padto().

    Suggested-by: David Miller
    Signed-off-by: Florian Fainelli
    Reviewed-by: Woojung Huh
    Signed-off-by: David S. Miller

    Florian Fainelli
     

17 Aug, 2017

1 commit

  • A couple fixes to new skb_send_sock infrastructure. However, no users
    currently exist for this code (adding user in next handful of patches)
    so it should not be possible to trigger a panic with existing in-kernel
    code.

    Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

10 Aug, 2017

1 commit

  • Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
    that has initialized its field uarg->mmp.

    Before this patch, a vhost-net with experimental_zcopytx can crash in

    mm_unaccount_pinned_pages
    sock_zerocopy_put
    skb_zcopy_clear
    skb_release_data

    Only sock_zerocopy_alloc initializes this field. Move the unaccount
    call from generic sock_zerocopy_put to its specific callback
    sock_zerocopy_callback.

    Fixes: a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
    Reported-by: David Ahern
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

04 Aug, 2017

6 commits

  • Bound the number of pages that a user may pin.

    Follow the lead of perf tools to maintain a per-user bound on memory
    locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • In the simple case, each sendmsg() call generates data and eventually
    a zerocopy ready notification N, where N indicates the Nth successful
    invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

    TCP and corked sockets can cause send() calls to append new data to an
    existing sk_buff and, thus, ubuf_info. In that case the notification
    must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
    and add skb_zerocopy_realloc() to optionally extend an existing range.

    Also coalesce notifications in this common case: if a notification
    [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
    the head of the queue to read [0, 1].

    Coalescing is limited to a few TSO frames worth of data to bound
    notification latency.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
    skb_zerocopy_clone() wherever needed due to skb split, merge, resize
    or clone.

    Split skb_orphan_frags into two variants. The split, merge, .. paths
    support reference counted zerocopy buffers, so do not do a deep copy.
    Add skb_orphan_frags_rx for paths that may loop packets to receive
    sockets. That is not allowed, as it may cause unbounded latency.
    Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

    The exact locations to modify were chosen by exhaustively searching
    through all code that might modify skb_frag references and/or the
    the SKBTX_DEV_ZEROCOPY tx_flags bit.

    The changes err on the safe side, in two ways.

    (1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

    (2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • The send call ignores unknown flags. Legacy applications may already
    unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
    socket opts in to zerocopy.

    Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
    Processes can also query this socket option to detect kernel support
    for the feature. Older kernels will return ENOPROTOOPT.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • The kernel supports zerocopy sendmsg in virtio and tap. Expand the
    infrastructure to support other socket types. Introduce a completion
    notification channel over the socket error queue. Notifications are
    returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
    blocking the send/recv path on receiving notifications.

    Add reference counting, to support the skb split, merge, resize and
    clone operations possible with SOCK_STREAM and other socket types.

    The patch does not yet modify any datapaths.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Refine skb_copy_ubufs to support compound pages. With upcoming TCP
    zerocopy sendmsg, such fragments may appear.

    The existing code replaces each page one for one. Splitting each
    compound page into an independent number of regular pages can result
    in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

    Instead, fill all destination pages but the last to PAGE_SIZE.
    Split the existing alloc + copy loop into separate stages:
    1. compute bytelength and minimum number of pages to store this.
    2. allocate
    3. copy, filling each page except the last to PAGE_SIZE bytes
    4. update skb frag array

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

02 Aug, 2017

2 commits

  • Skb frags may contain compound pages. Various operations map frags
    temporarily using kmap_atomic, but this function works on single
    pages, not whole compound pages. The distinction is only relevant
    for high mem pages that require temporary mappings.

    Introduce a looping mechanism that for compound highmem pages maps
    one page at a time, does not change behavior on other pages.
    Use the loop in the kmap_atomic callers in net/core/skbuff.c.

    Verified by triggering skb_copy_bits with

    tcpdump -n -c 100 -i ${DEV} -w /dev/null &
    netperf -t TCP_STREAM -H ${HOST}

    and by triggering __skb_checksum with

    ethtool -K ${DEV} tx off

    repeated the tests with looping on a non-highmem platform
    (x86_64) by making skb_frag_must_loop always return true.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Add skb_send_sock to send an skbuff on a socket within the kernel.
    Arguments include an offset so that an skbuf might be sent in mulitple
    calls (e.g. send buffer limit is hit).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

25 Jul, 2017

1 commit

  • A null check is needed after all. netlink skbs can have skb->head be
    backed by vmalloc. The netlink destructor vfree()s head, then sets it to
    NULL. We then panic in skb_release_data with a NULL dereference.

    Re-add such a test.

    Alternative would be to switch to kvfree to free skb->head memory
    and remove the special handling in netlink destructor.

    Reported-by: kernel test robot
    Fixes: 06dc75ab06943 ("net: Revert "net: add function to allocate sk_buff head without data area")
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

21 Jul, 2017

1 commit


18 Jul, 2017

1 commit


17 Jul, 2017

1 commit


13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena