04 Nov, 2018

1 commit

  • [ Upstream commit dc012f3628eaecfb5ba68404a5c30ef501daf63d ]

    syzbot found a use-after-free in inet6_mc_check [1]

    The problem here is that inet6_mc_check() uses rcu
    and read_lock(&iml->sflock)

    So the fact that ip6_mc_leave_src() is called under RTNL
    and the socket lock does not help us, we need to acquire
    iml->sflock in write mode.

    In the future, we should convert all this stuff to RCU.

    [1]
    BUG: KASAN: use-after-free in ipv6_addr_equal include/net/ipv6.h:521 [inline]
    BUG: KASAN: use-after-free in inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
    Read of size 8 at addr ffff8801ce7f2510 by task syz-executor0/22432

    CPU: 1 PID: 22432 Comm: syz-executor0 Not tainted 4.19.0-rc7+ #280
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
    print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
    ipv6_addr_equal include/net/ipv6.h:521 [inline]
    inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
    __raw_v6_lookup+0x320/0x3f0 net/ipv6/raw.c:98
    ipv6_raw_deliver net/ipv6/raw.c:183 [inline]
    raw6_local_deliver+0x3d3/0xcb0 net/ipv6/raw.c:240
    ip6_input_finish+0x467/0x1aa0 net/ipv6/ip6_input.c:345
    NF_HOOK include/linux/netfilter.h:289 [inline]
    ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:426
    ip6_mc_input+0x48a/0xd20 net/ipv6/ip6_input.c:503
    dst_input include/net/dst.h:450 [inline]
    ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
    NF_HOOK include/linux/netfilter.h:289 [inline]
    ipv6_rcv+0x120/0x640 net/ipv6/ip6_input.c:271
    __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
    __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
    netif_receive_skb_internal+0x12c/0x620 net/core/dev.c:5126
    napi_frags_finish net/core/dev.c:5664 [inline]
    napi_gro_frags+0x75a/0xc90 net/core/dev.c:5737
    tun_get_user+0x3189/0x4250 drivers/net/tun.c:1923
    tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1968
    call_write_iter include/linux/fs.h:1808 [inline]
    do_iter_readv_writev+0x8b0/0xa80 fs/read_write.c:680
    do_iter_write+0x185/0x5f0 fs/read_write.c:959
    vfs_writev+0x1f1/0x360 fs/read_write.c:1004
    do_writev+0x11a/0x310 fs/read_write.c:1039
    __do_sys_writev fs/read_write.c:1112 [inline]
    __se_sys_writev fs/read_write.c:1109 [inline]
    __x64_sys_writev+0x75/0xb0 fs/read_write.c:1109
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x457421
    Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b5 fb ff c3 48 83 ec 08 e8 1a 2d 00 00 48 89 04 24 b8 14 00 00 00 0f 05 8b 3c 24 48 89 c2 e8 63 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
    RSP: 002b:00007f2d30ecaba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
    RAX: ffffffffffffffda RBX: 000000000000003e RCX: 0000000000457421
    RDX: 0000000000000001 RSI: 00007f2d30ecabf0 RDI: 00000000000000f0
    RBP: 0000000020000500 R08: 00000000000000f0 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: 00007f2d30ecb6d4
    R13: 00000000004c4890 R14: 00000000004d7b90 R15: 00000000ffffffff

    Allocated by task 22437:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3718 [inline]
    __kmalloc+0x14e/0x760 mm/slab.c:3727
    kmalloc include/linux/slab.h:518 [inline]
    sock_kmalloc+0x15a/0x1f0 net/core/sock.c:1983
    ip6_mc_source+0x14dd/0x1960 net/ipv6/mcast.c:427
    do_ipv6_setsockopt.isra.9+0x3afb/0x45d0 net/ipv6/ipv6_sockglue.c:743
    ipv6_setsockopt+0xbd/0x170 net/ipv6/ipv6_sockglue.c:933
    rawv6_setsockopt+0x59/0x140 net/ipv6/raw.c:1069
    sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3038
    __sys_setsockopt+0x1ba/0x3c0 net/socket.c:1902
    __do_sys_setsockopt net/socket.c:1913 [inline]
    __se_sys_setsockopt net/socket.c:1910 [inline]
    __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 22430:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3813
    __sock_kfree_s net/core/sock.c:2004 [inline]
    sock_kfree_s+0x29/0x60 net/core/sock.c:2010
    ip6_mc_leave_src+0x11a/0x1d0 net/ipv6/mcast.c:2448
    __ipv6_sock_mc_close+0x20b/0x4e0 net/ipv6/mcast.c:310
    ipv6_sock_mc_close+0x158/0x1d0 net/ipv6/mcast.c:328
    inet6_release+0x40/0x70 net/ipv6/af_inet6.c:452
    __sock_release+0xd7/0x250 net/socket.c:579
    sock_close+0x19/0x20 net/socket.c:1141
    __fput+0x385/0xa30 fs/file_table.c:278
    ____fput+0x15/0x20 fs/file_table.c:309
    task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:193 [inline]
    exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
    prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
    syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
    do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff8801ce7f2500
    which belongs to the cache kmalloc-192 of size 192
    The buggy address is located 16 bytes inside of
    192-byte region [ffff8801ce7f2500, ffff8801ce7f25c0)
    The buggy address belongs to the page:
    page:ffffea000739fc80 count:1 mapcount:0 mapping:ffff8801da800040 index:0x0
    flags: 0x2fffc0000000100(slab)
    raw: 02fffc0000000100 ffffea0006f6e548 ffffea000737b948 ffff8801da800040
    raw: 0000000000000000 ffff8801ce7f2000 0000000100000010 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8801ce7f2400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8801ce7f2480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    >ffff8801ce7f2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8801ce7f2580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff8801ce7f2600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

24 Aug, 2018

1 commit

  • [ Upstream commit 6c6da92808442908287fae8ebb0ca041a52469f4 ]

    After recieving MLD querys, we update idev->mc_maxdelay with max_delay
    from query header. This make the later unsolicited reports have the same
    interval with mc_maxdelay, which means we may send unsolicited reports with
    long interval time instead of default configured interval time.

    Also as we will not call ipv6_mc_reset() after device up. This issue will
    be there even after leave the group and join other groups.

    Fixes: fc4eba58b4c14 ("ipv6: make unsolicited report intervals configurable for mld")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     

28 Jul, 2018

1 commit

  • There are two scenarios that we will restore deleted records. The first is
    when device down and up(or unmap/remap). In this scenario the new filter
    mode is same with previous one. Because we get it from in_dev->mc_list and
    we do not touch it during device down and up.

    The other scenario is when a new socket join a group which was just delete
    and not finish sending status reports. In this scenario, we should use the
    current filter mode instead of restore old one. Here are 4 cases in total.

    old_socket new_socket before_fix after_fix
    IN(A) IN(A) ALLOW(A) ALLOW(A)
    IN(A) EX( ) TO_IN( ) TO_EX( )
    EX( ) IN(A) TO_EX( ) ALLOW(A)
    EX( ) EX( ) TO_EX( ) TO_EX( )

    Fixes: 24803f38a5c0b (igmp: do not remove igmp souce list info when set link down)
    Fixes: 1666d49e1d416 (mld: do not remove mld souce list info when set link down)
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     

03 Jan, 2018

1 commit

  • [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

    syzkaller reported crashes in IPv6 stack [1]

    Xin Long found that lo MTU was set to silly values.

    IPv6 stack reacts to changes to small MTU, by disabling itself under
    RTNL.

    But there is a window where threads not using RTNL can see a wrong
    device mtu. This can lead to surprises, in mld code where it is assumed
    the mtu is suitable.

    Fix this by reading device mtu once and checking IPv6 minimal MTU.

    [1]
    skbuff: skb_over_panic: text:0000000010b86b8d len:196 put:20
    head:000000003b477e60 data:000000000e85441e tail:0xd4 end:0xc0 dev:lo
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:104!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
    RSP: 0018:ffff8801db307508 EFLAGS: 00010286
    RAX: 0000000000000082 RBX: ffff8801c517e840 RCX: 0000000000000000
    RDX: 0000000000000082 RSI: 1ffff1003b660e61 RDI: ffffed003b660e95
    RBP: ffff8801db307570 R08: 1ffff1003b660e23 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff85bd4020
    R13: ffffffff84754ed2 R14: 0000000000000014 R15: ffff8801c4e26540
    FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000463610 CR3: 00000001c6698000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    skb_over_panic net/core/skbuff.c:109 [inline]
    skb_put+0x181/0x1c0 net/core/skbuff.c:1694
    add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
    add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
    mld_send_cr net/ipv6/mcast.c:1903 [inline]
    mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
    call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
    expire_timers kernel/time/timer.c:1357 [inline]
    __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
    run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
    __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
    invoke_softirq kernel/softirq.c:365 [inline]
    irq_exit+0x1d3/0x210 kernel/softirq.c:405
    exiting_irq arch/x86/include/asm/apic.h:540 [inline]
    smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
    apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Tested-by: Xin Long
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

04 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

16 Jun, 2017

3 commits

  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions (skb_put, __skb_put and pskb_put) return void *
    and remove all the casts across the tree, adding a (u8 *) cast only
    where the unsigned char pointer was used directly, all done with the
    following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    which actually doesn't cover pskb_put since there are only three
    users overall.

    A handful of stragglers were converted manually, notably a macro in
    drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
    instances in net/bluetooth/hci_sock.c. In the former file, I also
    had to fix one whitespace problem spatch introduced.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • A common pattern with skb_put() is to just want to memcpy()
    some data into the new space, introduce skb_put_data() for
    this.

    An spatch similar to the one for skb_put_zero() converts many
    of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

    (again, manually post-processed to retain some comments)

    Reviewed-by: Stephen Hemminger
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • There were many places that my previous spatch didn't find,
    as pointed out by yuan linyu in various patches.

    The following spatch found many more and also removes the
    now unnecessary casts:

    @@
    identifier p, p2;
    expression len;
    expression skb;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, len);
    |
    -memset(p, 0, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, sizeof(*p));
    |
    -memset(p, 0, sizeof(*p));
    )

    @@
    expression skb, len;
    @@
    -memset(skb_put(skb, len), 0, len);
    +skb_put_zero(skb, len);

    Apply it to the tree (with one manual fixup to keep the
    comment in vxlan.c, which spatch removed.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

29 Mar, 2017

1 commit


10 Feb, 2017

1 commit

  • In function igmpv3/mld_add_delrec() we allocate pmc and put it in
    idev->mc_tomb, so we should free it when we don't need it in del_delrec().
    But I removed kfree(pmc) incorrectly in latest two patches. Now fix it.

    Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...")
    Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...")
    Reported-by: Daniel Borkmann
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

17 Jan, 2017

1 commit

  • This is an IPv6 version of commit 24803f38a5c0 ("igmp: do not remove igmp
    souce list..."). In mld_del_delrec(), we will restore back all source filter
    info instead of flush them.

    Move mld_clear_delrec() from ipv6_mc_down() to ipv6_mc_destroy_dev() since
    we should not remove source list info when set link down. Remove
    igmp6_group_dropped() in ipv6_mc_destroy_dev() since we have called it in
    ipv6_mc_down().

    Also clear all source info after igmp6_group_dropped() instead of in it
    because ipv6_mc_down() will call igmp6_group_dropped().

    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

21 Oct, 2016

1 commit

  • Baozeng reported this deadlock case:

    CPU0 CPU1
    ---- ----
    lock([ 165.136033] sk_lock-AF_INET6);
    lock([ 165.136033] rtnl_mutex);
    lock([ 165.136033] sk_lock-AF_INET6);
    lock([ 165.136033] rtnl_mutex);

    Similar to commit 87e9f0315952
    ("ipv4: fix a potential deadlock in mcast getsockopt() path")
    this is due to we still have a case, ipv6_sock_mc_close(),
    where we acquire sk_lock before rtnl_lock. Close this deadlock
    with the similar solution, that is always acquire rtnl lock first.

    Fixes: baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket")
    Reported-by: Baozeng Ding
    Tested-by: Baozeng Ding
    Cc: Marcelo Ricardo Leitner
    Signed-off-by: Cong Wang
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    WANG Cong
     

09 Aug, 2016

1 commit

  • Based on RFC3376 5.1 and RFC3810 6.1

    If the per-interface listening change that triggers the new report is
    a filter mode change, then the next [Robustness Variable] State
    Change Reports will include a Filter Mode Change Record. This
    applies even if any number of source list changes occur in that
    period.

    Old State New State State Change Record Sent
    --------- --------- ------------------------
    INCLUDE (A) EXCLUDE (B) TO_EX (B)
    EXCLUDE (A) INCLUDE (B) TO_IN (B)

    So we should not send source-list change if there is a filter-mode change.

    Here are two scenarios:
    1. Group deleted and filter mode is EXCLUDE, which means we need send a
    TO_IN { }.
    2. Not group deleted, but has pcm->crcount, which means we need send a
    normal filter-mode-change.

    At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
    we stop add records and decrease sf_crcount directly

    Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html

    Signed-off-by: Hangbin Liu
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hangbin Liu
     

04 Mar, 2016

1 commit

  • The current reserved_tailroom calculation fails to take hlen and tlen into
    account.

    skb:
    [__hlen__|__data____________|__tlen___|__extra__]
    ^ ^
    head skb_end_offset

    In this representation, hlen + data + tlen is the size passed to alloc_skb.
    "extra" is the extra space made available in __alloc_skb because of
    rounding up by kmalloc. We can reorder the representation like so:

    [__hlen__|__data____________|__extra__|__tlen___]
    ^ ^
    head skb_end_offset

    The maximum space available for ip headers and payload without
    fragmentation is min(mtu, data + extra). Therefore,
    reserved_tailroom
    = data + extra + tlen - min(mtu, data + extra)
    = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
    = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

    Compare the second line to the current expression:
    reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
    and we can see that hlen and tlen are not taken into account.

    The min() in the third line can be expanded into:
    if mtu < skb_tailroom - tlen:
    reserved_tailroom = skb_tailroom - mtu
    else:
    reserved_tailroom = tlen

    Depending on hlen, tlen, mtu and the number of multicast address records,
    the current code may output skbs that have less tailroom than
    dev->needed_tailroom or it may output more skbs than needed because not all
    space available is used.

    Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
    Signed-off-by: Benjamin Poirier
    Acked-by: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Benjamin Poirier
     

17 Nov, 2015

1 commit

  • the OUTMCAST stat is double incremented, getting bumped once in the mcast code
    itself, and again in the common ip output path. Remove the mcast bump, as its
    not needed

    Validated by the reporter, with good results

    Signed-off-by: Neil Horman
    Reported-by: Claus Jensen
    CC: Claus Jensen
    CC: David Miller
    Signed-off-by: David S. Miller

    Neil Horman
     

08 Oct, 2015

1 commit


18 Sep, 2015

3 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Add a sock paramter to dst_output making dst_output_sk superfluous.
    Add a skb->sk parameter to all of the callers of dst_output
    Have the callers of dst_output_sk call dst_output.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

08 Apr, 2015

1 commit

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     

01 Apr, 2015

2 commits

  • The ipv6 code uses a mixture of coding styles. In some instances check for NULL
    pointer is done as x != NULL and sometimes as x. x is preferred according to
    checkpatch and this patch makes the code consistent by adopting the latter
    form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     
  • The ipv6 code uses a mixture of coding styles. In some instances check for NULL
    pointer is done as x == NULL and sometimes as !x. !x is preferred according to
    checkpatch and this patch makes the code consistent by adopting the latter
    form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

19 Mar, 2015

1 commit

  • in favor of their inner __ ones, which doesn't grab rtnl.

    As these functions need to operate on a locked socket, we can't be
    grabbing rtnl by then. It's too late and doing so causes reversed
    locking.

    So this patch:
    - move rtnl handling to callers instead while already fixing some
    reversed locking situations, like on vxlan and ipvs code.
    - renames __ ones to not have the __ mark:
    __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
    __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

28 Feb, 2015

2 commits

  • Joining multicast group on ethernet level via "ip maddr" command would
    not work if we have an Ethernet switch that does igmp snooping since
    the switch would not replicate multicast packets on ports that did not
    have IGMP reports for the multicast addresses.

    Linux vxlan interfaces created via "ip link add vxlan" have the group option
    that enables then to do the required join.

    By extending ip address command with option "autojoin" we can get similar
    functionality for openvswitch vxlan interfaces as well as other tunneling
    mechanisms that need to receive multicast traffic. The kernel code is
    structured similar to how the vxlan driver does a group join / leave.

    example:
    ip address add 224.1.1.10/24 dev eth5 autojoin
    ip address del 224.1.1.10/24 dev eth5

    Signed-off-by: Madhu Challa
    Signed-off-by: David S. Miller

    Madhu Challa
     
  • Based on the igmp v4 changes from Eric Dumazet.
    959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")

    These changes are needed to perform igmp v6 join/leave while
    RTNL is held.

    Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
    __ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
    proliferation of work queues.

    Signed-off-by: Madhu Challa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Madhu Challa
     

06 Nov, 2014

2 commits

  • It has been reported that generating an MLD listener report on
    devices with large MTUs (e.g. 9000) and a high number of IPv6
    addresses can trigger a skb_over_panic():

    skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
    head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
    dev:port1
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:100!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ixgbe(O)
    CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
    [...]
    Call Trace:

    [] ? skb_put+0x3a/0x3b
    [] ? add_grhead+0x45/0x8e
    [] ? add_grec+0x394/0x3d4
    [] ? mld_ifc_timer_expire+0x195/0x20d
    [] ? mld_dad_timer_expire+0x45/0x45
    [] ? call_timer_fn.isra.29+0x12/0x68
    [] ? run_timer_softirq+0x163/0x182
    [] ? __do_softirq+0xe0/0x21d
    [] ? irq_exit+0x4e/0xd3
    [] ? smp_apic_timer_interrupt+0x3b/0x46
    [] ? apic_timer_interrupt+0x6a/0x70

    mld_newpack() skb allocations are usually requested with dev->mtu
    in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
    we have changed the limit in order to be less likely to fail.

    However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
    macros, which determine if we may end up doing an skb_put() for
    adding another record. To avoid possible fragmentation, we check
    the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
    assumption as the actual max allocation size can be much smaller.

    The IGMP case doesn't have this issue as commit 57e1ab6eaddc
    ("igmp: refine skb allocations") stores the allocation size in
    the cb[].

    Set a reserved_tailroom to make it fit into the MTU and use
    skb_availroom() helper instead. This also allows to get rid of
    igmp_skb_size().

    Reported-by: Wei Liu
    Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
    Signed-off-by: Daniel Borkmann
    Cc: Eric Dumazet
    Cc: Hannes Frederic Sowa
    Cc: David L Stevens
    Acked-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Using a single fixed string is smaller code size than using
    a format and many string arguments.

    Reduces overall code size a little.

    $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
    text data bss dec hex filename
    34269 7012 14824 56105 db29 net/ipv4/igmp.o.new
    34315 7012 14824 56151 db57 net/ipv4/igmp.o.old
    30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new
    30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old
    11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new
    11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

23 Sep, 2014

1 commit

  • RFC2710 (MLDv1), section 3.7. says:

    The length of a received MLD message is computed by taking the
    IPv6 Payload Length value and subtracting the length of any IPv6
    extension headers present between the IPv6 header and the MLD
    message. If that length is greater than 24 octets, that indicates
    that there are other fields present *beyond* the fields described
    above, perhaps belonging to a *future backwards-compatible* version
    of MLD. An implementation of the version of MLD specified in this
    document *MUST NOT* send an MLD message longer than 24 octets and
    MUST ignore anything past the first 24 octets of a received MLD
    message.

    RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
    presence of MLDv1 routers:

    In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
    operate in version 1 compatibility mode. [...] When Host
    Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
    on that interface. When Host Compatibility Mode is MLDv1, a host
    acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
    on that interface. [...]

    While section 8.3.1. specifies *router* behaviour regarding presence
    of MLDv1 routers:

    MLDv2 routers may be placed on a network where there is at least
    one MLDv1 router. The following requirements apply:

    If an MLDv1 router is present on the link, the Querier MUST use
    the *lowest* version of MLD present on the network. This must be
    administratively assured. Routers that desire to be compatible
    with MLDv1 MUST have a configuration option to act in MLDv1 mode;
    if an MLDv1 router is present on the link, the system administrator
    must explicitly configure all MLDv2 routers to act in MLDv1 mode.
    When in MLDv1 mode, the Querier MUST send periodic General Queries
    truncated at the Multicast Address field (i.e., 24 bytes long),
    and SHOULD also warn about receiving an MLDv2 Query (such warnings
    must be rate-limited). The Querier MUST also fill in the Maximum
    Response Delay in the Maximum Response Code field, i.e., the
    exponential algorithm described in section 5.1.3. is not used. [...]

    That means that we should not get queries from different versions of
    MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
    and MRC == MRD (both fields are overlapping within the 24 octet range).

    Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
    address *listeners*:

    MLDv2 routers may be placed on a network where there are hosts
    that have not yet been upgraded to MLDv2. In order to be compatible
    with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
    mode. MLDv2 routers keep a compatibility mode per multicast address
    record. The compatibility mode of a multicast address is determined
    from the Multicast Address Compatibility Mode variable, which can be
    in one of the two following states: MLDv1 or MLDv2.

    The Multicast Address Compatibility Mode of a multicast address
    record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
    *received* for that multicast address. At the same time, the Older
    Version Host Present timer for the multicast address is set to Older
    Version Host Present Timeout seconds. The timer is re-set whenever a
    new MLDv1 Report is received for that multicast address. If the Older
    Version Host Present timer expires, the router switches back to
    Multicast Address Compatibility Mode of MLDv2 for that multicast
    address. [...]

    That means, what can happen is the following scenario, that hosts can
    act in MLDv1 compatibility mode when they previously have received an
    MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
    time, an MLDv2 router could start up and transmits MLDv2 startup query
    messages while being unaware of the current operational mode.

    Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
    listener report, so that the router according to RFC3810, section 8.3.2.
    would receive that and internally switch to MLDv1 compatibility as well.

    Right now, I believe since the initial implementation of MLDv2, Linux
    hosts would just silently drop such MLDv2 queries instead of replying
    with an MLDv1 listener report, which would prevent a MLDv2 router going
    into fallback mode (until it receives other MLDv1 queries).

    Since the mapping of MRC to MRD in exactly such cases can make use of
    the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
    aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
    the RFC. Since encodings are the same up to 32767, assume in such a
    situation this value as a hard upper limit we would clamp. We have asked
    one of the RFC authors on that regard, and he mentioned that there seem
    not to be any implementations that make use of that exponential algorithm
    on startup messages. In any case, this patch fixes this MLD
    interoperability issue.

    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

14 Sep, 2014

4 commits


10 Sep, 2014

1 commit


08 Sep, 2014

1 commit


06 Sep, 2014

1 commit

  • Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
    triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()

    ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
    take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
    ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
    calling ipv6_dev_mc_inc/dec.

    This patch moves ASSERT_RTNL() up a level in the call stack.

    Signed-off-by: Cong Wang
    Signed-off-by: Sabrina Dubroca
    Reported-by: Tommi Rantala
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

05 Sep, 2014

1 commit

  • This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
    robustness variable. It specifies how many retransmit of unsolicited mld
    retransmit should happen. Admins might want to tune this on lossy links.

    Also reset mld state on interface down/up, so we pick up new sysctl
    settings during interface up event.

    IPv6 certification requests this knob to be available.

    I didn't make this knob netns specific, as it is mostly a setting in a
    physical environment and should be per host.

    Cc: Flavio Leitner
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

25 Aug, 2014

1 commit

  • This patch makes no changes to the logic of the code but simply addresses
    coding style issues as detected by checkpatch.

    Both objdump and diff -w show no differences.

    A number of items are addressed in this patch:
    * Multiple spaces converted to tabs
    * Spaces before tabs removed.
    * Spaces in pointer typing cleansed (char *)foo etc.
    * Remove space after sizeof
    * Ensure spacing around comparators such as if statements.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

27 Jun, 2014

1 commit

  • Based on RFC3810 6.2, we also need to check the hop limit and router alert
    option besides source address.

    Signed-off-by: Hangbin Liu
    Acked-by: YOSHIFUJI Hideaki
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hangbin Liu
     

01 Apr, 2014

1 commit

  • After commit c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify
    processing to workqueue") some counters are now updated in process context
    and thus need to disable bh before doing so, otherwise deadlocks can
    happen on 32-bit archs. Fabio Estevam noticed this while while mounting
    a NFS volume on an ARM board.

    As a compensation for missing this I looked after the other *_STATS_BH
    and found three other calls which need updating:

    1) icmp6_send: ip6_fragment -> icmpv6_send -> icmp6_send (error handling)
    2) ip6_push_pending_frames: rawv6_sendmsg -> rawv6_push_pending_frames -> ...
    (only in case of icmp protocol with raw sockets in error handling)
    3) ping6_v6_sendmsg (error handling)

    Fixes: c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue")
    Reported-by: Fabio Estevam
    Tested-by: Fabio Estevam
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

18 Jan, 2014

1 commit

  • The RFC 3810 defines two type of messages for multicast
    listeners. The "Current State Report" message, as the name
    implies, refreshes the *current* state to the querier.
    Since the querier sends Query messages periodically, there
    is no need to retransmit the report.

    On the other hand, any change should be reported immediately
    using "State Change Report" messages. Since it's an event
    triggered by a change and that it can be affected by packet
    loss, the rfc states it should be retransmitted [RobVar] times
    to make sure routers will receive timely.

    Currently, we are sending "Current State Reports" after
    DAD is completed. Before that, we send messages using
    unspecified address (::) which should be silently discarded
    by routers.

    This patch changes to send "State Change Report" messages
    after DAD is completed fixing the behavior to be RFC compliant
    and also to pass TAHI IPv6 testsuite.

    Signed-off-by: Flavio Leitner
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Flavio Leitner