17 Dec, 2018

1 commit

  • [ Upstream commit 66033f47ca60294a95fc85ec3a3cc909dab7b765 ]

    Even if we send an IPv6 packet without options, MAX_HEADER might not be
    enough to account for the additional headroom required by alignment of
    hardware headers.

    On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
    sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
    100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
    bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

    Those would be enough to append our 14 bytes header, but we're going to
    align that to 16 bytes, and write 2 bytes out of the allocated slab in
    neigh_hh_output().

    KASan says:

    [ 264.967848] ==================================================================
    [ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
    [ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
    [ 264.967870]
    [ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
    [ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
    [ 264.967887] Call Trace:
    [ 264.967896] ([] show_stack+0x56/0xa0)
    [ 264.967903] [] dump_stack+0x23c/0x290
    [ 264.967912] [] print_address_description+0xf4/0x290
    [ 264.967919] [] kasan_report+0x13c/0x240
    [ 264.967927] [] ip6_finish_output2+0x1aec/0x1c70
    [ 264.967935] [] ip6_finish_output+0x430/0x7f0
    [ 264.967943] [] ip6_output+0x1f4/0x580
    [ 264.967953] [] ip6_xmit+0xfea/0x1ce8
    [ 264.967963] [] inet6_csk_xmit+0x282/0x3f8
    [ 264.968033] [] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
    [ 264.968037] [] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
    [ 264.968041] [] dev_hard_start_xmit+0x268/0x928
    [ 264.968069] [] sch_direct_xmit+0x7ae/0x1350
    [ 264.968071] [] __dev_queue_xmit+0x2b7c/0x3478
    [ 264.968075] [] ip_finish_output2+0xce2/0x11a0
    [ 264.968078] [] ip_finish_output+0x56c/0x8c8
    [ 264.968081] [] ip_output+0x226/0x4c0
    [ 264.968083] [] __ip_queue_xmit+0x894/0x1938
    [ 264.968100] [] sctp_packet_transmit+0x29d4/0x3648 [sctp]
    [ 264.968116] [] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
    [ 264.968131] [] sctp_outq_flush+0x22e/0x7d8 [sctp]
    [ 264.968146] [] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
    [ 264.968161] [] sctp_do_sm+0x222/0x648 [sctp]
    [ 264.968177] [] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
    [ 264.968192] [] __sctp_connect+0x830/0xc20 [sctp]
    [ 264.968208] [] sctp_inet_connect+0x2e6/0x378 [sctp]
    [ 264.968212] [] __sys_connect+0x21a/0x450
    [ 264.968215] [] sys_socketcall+0x3d0/0xb08
    [ 264.968218] [] system_call+0x2a2/0x2c0

    [...]

    Just like ip_finish_output2() does for IPv4, check that we have enough
    headroom in ip6_xmit(), and reallocate it if we don't.

    This issue is older than git history.

    Reported-by: Jianlin Shi
    Signed-off-by: Stefano Brivio
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Brivio
     

29 Sep, 2018

1 commit

  • [ Upstream commit bbd6528d28c1b8e80832b3b018ec402b6f5c3215 ]

    In the unlikely case ip6_xmit() has to call skb_realloc_headroom(),
    we need to call skb_set_owner_w() before consuming original skb,
    otherwise we risk a use-after-free.

    Bring IPv6 in line with what we do in IPv4 to fix this.

    Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

28 Jul, 2018

1 commit

  • [ Upstream commit 3dd1c9a1270736029ffca670e9bd0265f4120600 ]

    The skb hash for locally generated ip[v6] fragments belonging
    to the same datagram can vary in several circumstances:
    * for connected UDP[v6] sockets, the first fragment get its hash
    via set_owner_w()/skb_set_hash_from_sk()
    * for unconnected IPv6 UDPv6 sockets, the first fragment can get
    its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
    auto_flowlabel is enabled

    For the following frags the hash is usually computed via
    skb_get_hash().
    The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
    scenario the egress tx queue can be selected on a per packet basis
    via the skb hash.
    It may also fool flow-oriented schedulers to place fragments belonging
    to the same datagram in different flows.

    Fix the issue by copying the skb hash from the head frag into
    the others at fragmentation time.

    Before this commit:
    perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
    netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
    perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
    perf script
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0

    After this commit:
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
    probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0

    Fixes: b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit")
    Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

12 Jun, 2018

1 commit


25 May, 2018

1 commit

  • [ Upstream commit 113f99c3358564a0647d444c2ae34e8b1abfd5b9 ]

    Device features may change during transmission. In particular with
    corking, a device may toggle scatter-gather in between allocating
    and writing to an skb.

    Do not unconditionally assume that !NETIF_F_SG at write time implies
    that the same held at alloc time and thus the skb has sufficient
    tailroom.

    This issue predates git history.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

12 Apr, 2018

3 commits

  • [ Upstream commit 71a1c915238c970cd9bdd5bf158b1279d6b6d55b ]

    At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
    IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
    for NFPROTO_IPV6 / NF_INET_FORWARD. As a result, these counters get
    incremented regardless of whether or not the netfilter hook allows the
    packet to continue being processed. This change increments the counters
    in ip6_forward_finish() so that it will not happen if the netfilter hook
    chooses to terminate the packet, which is similar to how IPv4 works.

    Signed-off-by: Jeff Barnhill
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jeff Barnhill
     
  • [ Upstream commit 10b8a3de603df7b96004179b1b33b1708c76d144 ]

    While building ipv6 datagram we currently allow arbitrary large
    extheaders, even beyond pmtu size. The syzbot has found a way
    to exploit the above to trigger the following splat:

    kernel BUG at ./include/linux/skbuff.h:2073!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 4230 Comm: syzkaller672661 Not tainted 4.16.0-rc2+ #326
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__skb_pull include/linux/skbuff.h:2073 [inline]
    RIP: 0010:__ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636
    RSP: 0018:ffff8801bc18f0f0 EFLAGS: 00010293
    RAX: ffff8801b17400c0 RBX: 0000000000000738 RCX: ffffffff84f01828
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b415ac18
    RBP: ffff8801bc18f360 R08: ffff8801b4576844 R09: 0000000000000000
    R10: ffff8801bc18f380 R11: ffffed00367aee4e R12: 00000000000000d6
    R13: ffff8801b415a740 R14: dffffc0000000000 R15: ffff8801b45767c0
    FS: 0000000001535880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000002000b000 CR3: 00000001b4123001 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ip6_finish_skb include/net/ipv6.h:969 [inline]
    udp_v6_push_pending_frames+0x269/0x3b0 net/ipv6/udp.c:1073
    udpv6_sendmsg+0x2a96/0x3400 net/ipv6/udp.c:1343
    inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:640
    ___sys_sendmsg+0x320/0x8b0 net/socket.c:2046
    __sys_sendmmsg+0x1ee/0x620 net/socket.c:2136
    SYSC_sendmmsg net/socket.c:2167 [inline]
    SyS_sendmmsg+0x35/0x60 net/socket.c:2162
    do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    RIP: 0033:0x4404c9
    RSP: 002b:00007ffdce35f948 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004404c9
    RDX: 0000000000000003 RSI: 0000000020001f00 RDI: 0000000000000003
    RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
    R10: 0000000020000080 R11: 0000000000000217 R12: 0000000000401df0
    R13: 0000000000401e80 R14: 0000000000000000 R15: 0000000000000000
    Code: ff e8 1d 5e b9 fc e9 15 e9 ff ff e8 13 5e b9 fc e9 44 e8 ff ff e8 29
    5e b9 fc e9 c0 e6 ff ff e8 3f f3 80 fc 0f 0b e8 38 f3 80 fc 0b 49 8d
    87 80 00 00 00 4d 8d 87 84 00 00 00 48 89 85 20 fe
    RIP: __skb_pull include/linux/skbuff.h:2073 [inline] RSP: ffff8801bc18f0f0
    RIP: __ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP:
    ffff8801bc18f0f0

    As stated by RFC 7112 section 5:

    When a host fragments an IPv6 datagram, it MUST include the entire
    IPv6 Header Chain in the First Fragment.

    So this patch addresses the issue dropping datagrams with excessive
    extheader length. It also updates the error path to report to the
    calling socket nonnegative pmtu values.

    The issue apparently predates git history.

    v1 -> v2: cleanup error path, as per Eric's suggestion

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: syzbot+91e6f9932ff122fa4410@syzkaller.appspotmail.com
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit 09ee9dba9611cd382fd360a99ad1c2fa23bfdca8 ]

    If SNAT modifies the source address the resulting packet might match
    an IPsec policy, reinject the packet if that's the case.

    The exact same thing is already done for IPv4.

    Signed-off-by: Tobias Brunner
    Acked-by: Steffen Klassert
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tobias Brunner
     

31 Jan, 2018

3 commits

  • [ Upstream commit 95ef498d977bf44ac094778fd448b98af158a3e6 ]

    In my last patch, I missed fact that cork.base.dst was not initialized
    in ip6_make_skb() :

    If ip6_setup_cork() returns an error, we might attempt a dst_release()
    on some random pointer.

    Fixes: 862c03ee1deb ("ipv6: fix possible mem leaks in ipv6_make_skb()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 749439bfac6e1a2932c582e2699f91d329658196 ]

    The logic in __ip6_append_data() assumes that the MTU is at least large
    enough for the headers. A device's MTU may be adjusted after being
    added while sendmsg() is processing data, resulting in
    __ip6_append_data() seeing any MTU. For an mtu smaller than the size of
    the fragmentation header, the math results in a negative 'maxfraglen',
    which causes problems when refragmenting any previous skb in the
    skb_write_queue, leaving it possibly malformed.

    Instead sendmsg returns EINVAL when the mtu is calculated to be less
    than IPV6_MIN_MTU.

    Found by syzkaller:
    kernel BUG at ./include/linux/skbuff.h:2064!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 14216 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #2
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    task: ffff8801d0b68580 task.stack: ffff8801ac6b8000
    RIP: 0010:__skb_pull include/linux/skbuff.h:2064 [inline]
    RIP: 0010:__ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617
    RSP: 0018:ffff8801ac6bf570 EFLAGS: 00010216
    RAX: 0000000000010000 RBX: 0000000000000028 RCX: ffffc90003cce000
    RDX: 00000000000001b8 RSI: ffffffff839df06f RDI: ffff8801d9478ca0
    RBP: ffff8801ac6bf780 R08: ffff8801cc3f1dbc R09: 0000000000000000
    R10: ffff8801ac6bf7a0 R11: 43cb4b7b1948a9e7 R12: ffff8801cc3f1dc8
    R13: ffff8801cc3f1d40 R14: 0000000000001036 R15: dffffc0000000000
    FS: 00007f43d740c700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f7834984000 CR3: 00000001d79b9000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ip6_finish_skb include/net/ipv6.h:911 [inline]
    udp_v6_push_pending_frames+0x255/0x390 net/ipv6/udp.c:1093
    udpv6_sendmsg+0x280d/0x31a0 net/ipv6/udp.c:1363
    inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x352/0x5a0 net/socket.c:1750
    SyS_sendto+0x40/0x50 net/socket.c:1718
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x4512e9
    RSP: 002b:00007f43d740bc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00000000007180a8 RCX: 00000000004512e9
    RDX: 000000000000002e RSI: 0000000020d08000 RDI: 0000000000000005
    RBP: 0000000000000086 R08: 00000000209c1000 R09: 000000000000001c
    R10: 0000000000040800 R11: 0000000000000216 R12: 00000000004b9c69
    R13: 00000000ffffffff R14: 0000000000000005 R15: 00000000202c2000
    Code: 9e 01 fe e9 c5 e8 ff ff e8 7f 9e 01 fe e9 4a ea ff ff 48 89 f7 e8 52 9e 01 fe e9 aa eb ff ff e8 a8 b6 cf fd 0f 0b e8 a1 b6 cf fd 0b 49 8d 45 78 4d 8d 45 7c 48 89 85 78 fe ff ff 49 8d 85 ba
    RIP: __skb_pull include/linux/skbuff.h:2064 [inline] RSP: ffff8801ac6bf570
    RIP: __ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617 RSP: ffff8801ac6bf570

    Reported-by: syzbot
    Signed-off-by: Mike Maloney
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Mike Maloney
     
  • [ Upstream commit e9191ffb65d8e159680ce0ad2224e1acbde6985c ]

    Commit 513674b5a2c9 ("net: reevalulate autoflowlabel setting after
    sysctl setting") removed the initialisation of
    ipv6_pinfo::autoflowlabel and added a second flag to indicate
    whether this field or the net namespace default should be used.

    The getsockopt() handling for this case was not updated, so it
    currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
    not explicitly enabled. Fix it to return the effective value, whether
    that has been set at the socket or net namespace level.

    Fixes: 513674b5a2c9 ("net: reevalulate autoflowlabel setting after sysctl ...")
    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings
     

17 Jan, 2018

1 commit

  • [ Upstream commit 862c03ee1deb7e19e0f9931682e0294ecd1fcaf9 ]

    ip6_setup_cork() might return an error, while memory allocations have
    been done and must be rolled back.

    Fixes: 6422398c2ab0 ("ipv6: introduce ipv6_make_skb")
    Signed-off-by: Eric Dumazet
    Cc: Vlad Yasevich
    Reported-by: Mike Maloney
    Acked-by: Mike Maloney
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

03 Jan, 2018

1 commit

  • [ Upstream commit 513674b5a2c9c7a67501506419da5c3c77ac6f08 ]

    sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
    If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
    supposed to not include flowlabel. This is true for normal packet, but
    not for reset packet.

    The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
    we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
    changed, so the sock will keep the old behavior in terms of auto
    flowlabel. Reset packet is suffering from this problem, because reset
    packet is sent from a special control socket, which is created at boot
    time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
    socket will always have its ipv6_pinfo.autoflowlabel set, even after
    user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
    have flowlabel. Normal sock created before sysctl setting suffers from
    the same issue. We can't even turn off autoflowlabel unless we kill all
    socks in the hosts.

    To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
    autoflowlabel setting from user, otherwise we always call
    ip6_default_np_autolabel() which has the new settings of sysctl.

    Note, this changes behavior a little bit. Before commit 42240901f7c4
    (ipv6: Implement different admin modes for automatic flow labels), the
    autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
    existing connection will change autoflowlabel behavior. After that
    commit, autoflowlabel behavior is sticky in the whole life of the sock.
    With this patch, the behavior isn't sticky again.

    Cc: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: Shaohua Li
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

22 Oct, 2017

1 commit

  • When syzkaller team brought us a C repro for the crash [1] that
    had been reported many times in the past, I finally could find
    the root cause.

    If FlowLabel info is merged by fl6_merge_options(), we leave
    part of the opt_space storage provided by udp/raw/l2tp with random value
    in opt_space.tot_len, unless a control message was provided at sendmsg()
    time.

    Then ip6_setup_cork() would use this random value to perform a kzalloc()
    call. Undefined behavior and crashes.

    Fix is to properly set tot_len in fl6_merge_options()

    At the same time, we can also avoid consuming memory and cpu cycles
    to clear it, if every option is copied via a kmemdup(). This is the
    change in ip6_setup_cork().

    [1]
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 6613 Comm: syz-executor0 Not tainted 4.14.0-rc4+ #127
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    task: ffff8801cb64a100 task.stack: ffff8801cc350000
    RIP: 0010:ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168
    RSP: 0018:ffff8801cc357550 EFLAGS: 00010203
    RAX: dffffc0000000000 RBX: ffff8801cc357748 RCX: 0000000000000010
    RDX: 0000000000000002 RSI: ffffffff842bd1d9 RDI: 0000000000000014
    RBP: ffff8801cc357620 R08: ffff8801cb17f380 R09: ffff8801cc357b10
    R10: ffff8801cb64a100 R11: 0000000000000000 R12: ffff8801cc357ab0
    R13: ffff8801cc357b10 R14: 0000000000000000 R15: ffff8801c3bbf0c0
    FS: 00007f9c5c459700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020324000 CR3: 00000001d1cf2000 CR4: 00000000001406f0
    DR0: 0000000020001010 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Call Trace:
    ip6_make_skb+0x282/0x530 net/ipv6/ip6_output.c:1729
    udpv6_sendmsg+0x2769/0x3380 net/ipv6/udp.c:1340
    inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x358/0x5a0 net/socket.c:1750
    SyS_sendto+0x40/0x50 net/socket.c:1718
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x4520a9
    RSP: 002b:00007f9c5c458c08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004520a9
    RDX: 0000000000000001 RSI: 0000000020fd1000 RDI: 0000000000000016
    RBP: 0000000000000086 R08: 0000000020e0afe4 R09: 000000000000001c
    R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004bb1ee
    R13: 00000000ffffffff R14: 0000000000000016 R15: 0000000000000029
    Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ea 0f 00 00 48 8d 79 04 48 b8 00 00 00 00 00 fc ff df 45 8b 74 24 04 48 89 fa 48 c1 ea 03 b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
    RIP: ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: ffff8801cc357550

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Aug, 2017

1 commit


26 Jul, 2017

1 commit

  • RFC 2465 defines ipv6IfStatsOutFragFails as:

    "The number of IPv6 datagrams that have been discarded
    because they needed to be fragmented at this output
    interface but could not be."

    The existing implementation, instead, would increase the counter
    twice in case we fail to allocate room for single fragments:
    once for the fragment, once for the datagram.

    This didn't look intentional though. In one of the two affected
    affected failure paths, the double increase was simply a result
    of a new 'goto fail' statement, introduced to avoid a skb leak.
    The other path appears to be affected since at least 2.6.12-rc2.

    Reported-by: Sabrina Dubroca
    Fixes: 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak")
    Signed-off-by: Stefano Brivio
    Signed-off-by: David S. Miller

    Stefano Brivio
     

18 Jul, 2017

1 commit


01 Jul, 2017

2 commits


24 Jun, 2017

1 commit

  • Our customer encountered stuck NFS writes for blocks starting at specific
    offsets w.r.t. page boundary caused by networking stack sending packets via
    UFO enabled device with wrong checksum. The problem can be reproduced by
    composing a long UDP datagram from multiple parts using MSG_MORE flag:

    sendto(sd, buff, 1000, MSG_MORE, ...);
    sendto(sd, buff, 1000, MSG_MORE, ...);
    sendto(sd, buff, 3000, 0, ...);

    Assume this packet is to be routed via a device with MTU 1500 and
    NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
    this condition is tested (among others) to decide whether to call
    ip_ufo_append_data():

    ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))

    At the moment, we already have skb with 1028 bytes of data which is not
    marked for GSO so that the test is false (fragheaderlen is usually 20).
    Thus we append second 1000 bytes to this skb without invoking UFO. Third
    sendto(), however, has sufficient length to trigger the UFO path so that we
    end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
    uses udp_csum() to calculate the checksum but that assumes all fragments
    have correct checksum in skb->csum which is not true for UFO fragments.

    When checking against MTU, we need to add skb->len to length of new segment
    if we already have a partially filled skb and fragheaderlen only if there
    isn't one.

    In the IPv6 case, skb can only be null if this is the first segment so that
    we have to use headersize (length of the first IPv6 header) rather than
    fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.

    Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
    Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for
    ip6 fragment between __ip6_append_data and ip6_finish_output")
    Signed-off-by: Michal Kubecek
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Michal Kubeček
     

18 Jun, 2017

1 commit


16 Jun, 2017

1 commit

  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions return void * and remove all the casts across
    the tree, adding a (u8 *) cast only where the unsigned char pointer
    was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

    Note that the last part there converts from push(...)[0] to the
    more idiomatic *(u8 *)push(...).

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

11 Jun, 2017

1 commit


10 Jun, 2017

1 commit


22 May, 2017

1 commit

  • Andrey Konovalov and idaifish@gmail.com reported crashes caused by
    one skb shared_info being overwritten from __ip6_append_data()

    Andrey program lead to following state :

    copy -4200 datalen 2000 fraglen 2040
    maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

    The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
    fraggap, 0); is overwriting skb->head and skb_shared_info

    Since we apparently detect this rare condition too late, move the
    code earlier to even avoid allocating skb and risking crashes.

    Once again, many thanks to Andrey and syzkaller team.

    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Reported-by:
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2017

2 commits

  • Do not use unsigned variables to see if it returns a negative
    error or not.

    Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options")
    Reported-by: Julia Lawall
    Signed-off-by: David S. Miller

    David S. Miller
     
  • The KASAN warning repoted below was discovered with a syzkaller
    program. The reproducer is basically:
    int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
    send(s, &one_byte_of_data, 1, MSG_MORE);
    send(s, &more_than_mtu_bytes_data, 2000, 0);

    The socket() call sets the nexthdr field of the v6 header to
    NEXTHDR_HOP, the first send call primes the payload with a non zero
    byte of data, and the second send call triggers the fragmentation path.

    The fragmentation code tries to parse the header options in order
    to figure out where to insert the fragment option. Since nexthdr points
    to an invalid option, the calculation of the size of the network header
    can made to be much larger than the linear section of the skb and data
    is read outside of it.

    This fix makes ip6_find_1stfrag return an error if it detects
    running out-of-bounds.

    [ 42.361487] ==================================================================
    [ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
    [ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
    [ 42.366469]
    [ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
    [ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
    [ 42.368824] Call Trace:
    [ 42.369183] dump_stack+0xb3/0x10b
    [ 42.369664] print_address_description+0x73/0x290
    [ 42.370325] kasan_report+0x252/0x370
    [ 42.370839] ? ip6_fragment+0x11c8/0x3730
    [ 42.371396] check_memory_region+0x13c/0x1a0
    [ 42.371978] memcpy+0x23/0x50
    [ 42.372395] ip6_fragment+0x11c8/0x3730
    [ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110
    [ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0
    [ 42.374263] ? ip6_forward+0x2e30/0x2e30
    [ 42.374803] ip6_finish_output+0x584/0x990
    [ 42.375350] ip6_output+0x1b7/0x690
    [ 42.375836] ? ip6_finish_output+0x990/0x990
    [ 42.376411] ? ip6_fragment+0x3730/0x3730
    [ 42.376968] ip6_local_out+0x95/0x160
    [ 42.377471] ip6_send_skb+0xa1/0x330
    [ 42.377969] ip6_push_pending_frames+0xb3/0xe0
    [ 42.378589] rawv6_sendmsg+0x2051/0x2db0
    [ 42.379129] ? rawv6_bind+0x8b0/0x8b0
    [ 42.379633] ? _copy_from_user+0x84/0xe0
    [ 42.380193] ? debug_check_no_locks_freed+0x290/0x290
    [ 42.380878] ? ___sys_sendmsg+0x162/0x930
    [ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120
    [ 42.382074] ? sock_has_perm+0x1f6/0x290
    [ 42.382614] ? ___sys_sendmsg+0x167/0x930
    [ 42.383173] ? lock_downgrade+0x660/0x660
    [ 42.383727] inet_sendmsg+0x123/0x500
    [ 42.384226] ? inet_sendmsg+0x123/0x500
    [ 42.384748] ? inet_recvmsg+0x540/0x540
    [ 42.385263] sock_sendmsg+0xca/0x110
    [ 42.385758] SYSC_sendto+0x217/0x380
    [ 42.386249] ? SYSC_connect+0x310/0x310
    [ 42.386783] ? __might_fault+0x110/0x1d0
    [ 42.387324] ? lock_downgrade+0x660/0x660
    [ 42.387880] ? __fget_light+0xa1/0x1f0
    [ 42.388403] ? __fdget+0x18/0x20
    [ 42.388851] ? sock_common_setsockopt+0x95/0xd0
    [ 42.389472] ? SyS_setsockopt+0x17f/0x260
    [ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe
    [ 42.390650] SyS_sendto+0x40/0x50
    [ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe
    [ 42.391731] RIP: 0033:0x7fbbb711e383
    [ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    [ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
    [ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
    [ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
    [ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
    [ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
    [ 42.397257]
    [ 42.397411] Allocated by task 3789:
    [ 42.397702] save_stack_trace+0x16/0x20
    [ 42.398005] save_stack+0x46/0xd0
    [ 42.398267] kasan_kmalloc+0xad/0xe0
    [ 42.398548] kasan_slab_alloc+0x12/0x20
    [ 42.398848] __kmalloc_node_track_caller+0xcb/0x380
    [ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0
    [ 42.399654] __alloc_skb+0xf8/0x580
    [ 42.400003] sock_wmalloc+0xab/0xf0
    [ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0
    [ 42.400813] ip6_append_data+0x1a8/0x2f0
    [ 42.401122] rawv6_sendmsg+0x11ee/0x2db0
    [ 42.401505] inet_sendmsg+0x123/0x500
    [ 42.401860] sock_sendmsg+0xca/0x110
    [ 42.402209] ___sys_sendmsg+0x7cb/0x930
    [ 42.402582] __sys_sendmsg+0xd9/0x190
    [ 42.402941] SyS_sendmsg+0x2d/0x50
    [ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe
    [ 42.403718]
    [ 42.403871] Freed by task 1794:
    [ 42.404146] save_stack_trace+0x16/0x20
    [ 42.404515] save_stack+0x46/0xd0
    [ 42.404827] kasan_slab_free+0x72/0xc0
    [ 42.405167] kfree+0xe8/0x2b0
    [ 42.405462] skb_free_head+0x74/0xb0
    [ 42.405806] skb_release_data+0x30e/0x3a0
    [ 42.406198] skb_release_all+0x4a/0x60
    [ 42.406563] consume_skb+0x113/0x2e0
    [ 42.406910] skb_free_datagram+0x1a/0xe0
    [ 42.407288] netlink_recvmsg+0x60d/0xe40
    [ 42.407667] sock_recvmsg+0xd7/0x110
    [ 42.408022] ___sys_recvmsg+0x25c/0x580
    [ 42.408395] __sys_recvmsg+0xd6/0x190
    [ 42.408753] SyS_recvmsg+0x2d/0x50
    [ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe
    [ 42.409513]
    [ 42.409665] The buggy address belongs to the object at ffff88000969e780
    [ 42.409665] which belongs to the cache kmalloc-512 of size 512
    [ 42.410846] The buggy address is located 24 bytes inside of
    [ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980)
    [ 42.411941] The buggy address belongs to the page:
    [ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
    [ 42.413298] flags: 0x100000000008100(slab|head)
    [ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
    [ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
    [ 42.415074] page dumped because: kasan: bad access detected
    [ 42.415604]
    [ 42.415757] Memory state around the buggy address:
    [ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 42.418273] ^
    [ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 42.419882] ==================================================================

    Reported-by: Andrey Konovalov
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

14 Mar, 2017

1 commit

  • ip6_fragment, in case skb has a fraglist, checks if the
    skb is cloned. If it is, it will move to the 'slow path' and allocates
    new skbs for each fragment.

    However, right before entering the slowpath loop, it updates the
    nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
    to account for the fragment header that will be inserted in the new
    ipv6-fragment skbs.

    In case original skb is cloned this munges nexthdr value of another
    skb. Avoid this by doing the nexthdr update for each of the new fragment
    skbs separately.

    This was observed with tcpdump on a bridge device where netfilter ipv6
    reassembly is active: tcpdump shows malformed fragment headers as
    the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.

    Cc: Hannes Frederic Sowa
    Reported-by: Andreas Karis
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

10 Mar, 2017

1 commit

  • commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed
    packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on
    later transformed packets") added a check that 'rt->dst.header_len' isn't
    zero in order to skip UFO, but it doesn't include IPcomp in transport mode
    where it equals zero.

    Packets, after payload compression, may not require further fragmentation,
    and if original length exceeds MTU, later compressed packets will be
    transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
    on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

    * IPv4 case, offset is wrong + unnecessary fragmentation
    udp_ipsec.sh -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
    proto Compressed IP (108), length 49)
    10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
    IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
    proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

    * IPv6 case, sending small fragments
    udp_ipsec.sh -6 -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
    payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
    payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

    Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
    if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

    Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets")
    Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed packets")
    Signed-off-by: Alexey Kodanev
    Signed-off-by: David S. Miller

    Alexey Kodanev
     

20 Feb, 2017

1 commit


19 Feb, 2017

1 commit


17 Feb, 2017

1 commit


15 Feb, 2017

1 commit

  • This patch adds a check for the problematic case of an IPv4-mapped IPv6
    source address and a destination address that is neither an IPv4-mapped
    IPv6 address nor in6addr_any, and returns an appropriate error. The
    check in done before returning from looking up the route.

    Signed-off-by: Jonathan T. Leighton
    Signed-off-by: David S. Miller

    Jonathan T. Leighton
     

12 Feb, 2017

1 commit


08 Feb, 2017

2 commits

  • When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    The datagram protocols can use MSG_CONFIRM to confirm the
    neighbour. When used with MSG_PROBE we do not reach the
    code where neighbour is confirmed, so we have to do the
    same slow lookup by using the dst_confirm_neigh() helper.
    When MSG_PROBE is not used, ip_append_data/ip6_append_data
    will set the skb flag dst_pending_confirm.

    Reported-by: YueHaibing
    Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
    Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Add new skbuff flag to allow protocols to confirm neighbour.
    When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    Add sock_confirm_neigh() helper to confirm the neighbour and
    use it for IPv4, IPv6 and VRF before dst_neigh_output.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     

31 Jan, 2017

1 commit

  • IPv6 will mark data that is smaller that mtu - headersize as
    CHECKSUM_PARTIAL, but if the data will completely fill the mtu,
    the packet checksum will be computed in software instead.
    Extend the conditional to include the data that fills the mtu
    as well.

    Signed-off-by: Vladislav Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

27 Jan, 2017

1 commit

  • Unlike ipv4, this control socket is shared by all cpus so we cannot use
    it as scratchpad area to annotate the mark that we pass to ip6_xmit().

    Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
    family caches the flowi6 structure in the sctp_transport structure, so
    we cannot use to carry the mark unless we later on reset it back, which
    I discarded since it looks ugly to me.

    Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled")
    Suggested-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

30 Dec, 2016

1 commit

  • …_append_data and ip6_finish_output

    There is an inconsistent conditional judgement between __ip6_append_data
    and ip6_finish_output functions, the variable length in __ip6_append_data
    just include the length of application's payload and udp6 header, don't
    include the length of ipv6 header, but in ip6_finish_output use
    (skb->len > ip6_skb_dst_mtu(skb)) as judgement, and skb->len include the
    length of ipv6 header.

    That causes some particular application's udp6 payloads whose length are
    between (MTU - IPv6 Header) and MTU were fragmented by ip6_fragment even
    though the rst->dev support UFO feature.

    Add the length of ipv6 header to length in __ip6_append_data to keep
    consistent conditional judgement as ip6_finish_output for ip6 fragment.

    Signed-off-by: Zheng Li <james.z.li@ericsson.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Zheng Li
     

26 Nov, 2016

1 commit

  • If the cgroup associated with the receiving socket has an eBPF
    programs installed, run them from ip_output(), ip6_output() and
    ip_mc_output(). From mentioned functions we have two socket contexts
    as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
    okfn()."). We explicitly need to use sk instead of skb->sk here,
    since otherwise the same program would run multiple times on egress
    when encap devices are involved, which is not desired in our case.

    eBPF programs used in this context are expected to either return 1 to
    let the packet pass, or != 1 to drop them. The programs have access to
    the skb through bpf_skb_load_bytes(), and the payload starts at the
    network headers (L3).

    Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
    for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
    the feature is unused.

    Signed-off-by: Daniel Mack
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Mack