31 Jan, 2018

40 commits

  • [ upstream commit 290af86629b25ffd1ed6232c4e9107da031705cb ]

    The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

    A quote from goolge project zero blog:
    "At this point, it would normally be necessary to locate gadgets in
    the host kernel code that can be used to actually leak data by reading
    from an attacker-controlled location, shifting and masking the result
    appropriately and then using the result of that as offset to an
    attacker-controlled address for a load. But piecing gadgets together
    and figuring out which ones work in a speculation context seems annoying.
    So instead, we decided to use the eBPF interpreter, which is built into
    the host kernel - while there is no legitimate way to invoke it from inside
    a VM, the presence of the code in the host kernel's text section is sufficient
    to make it usable for the attack, just like with ordinary ROP gadgets."

    To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
    option that removes interpreter from the kernel in favor of JIT-only mode.
    So far eBPF JIT is supported by:
    x64, arm64, arm32, sparc64, s390, powerpc64, mips64

    The start of JITed program is randomized and code page is marked as read-only.
    In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

    v2->v3:
    - move __bpf_prog_ret0 under ifdef (Daniel)

    v1->v2:
    - fix init order, test_bpf and cBPF (Daniel's feedback)
    - fix offloaded bpf (Jakub's feedback)
    - add 'return 0' dummy in case something can invoke prog->bpf_func
    - retarget bpf tree. For bpf-next the patch would need one extra hunk.
    It will be sent when the trees are merged back to net-next

    Considered doing:
    int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
    but it seems better to land the patch as-is and in bpf-next remove
    bpf_jit_enable global variable from all JITs, consolidate in one place
    and remove this jit_init() function.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream.

    The hrtimer interrupt code contains a hang detection and mitigation
    mechanism, which prevents that a long delayed hrtimer interrupt causes a
    continous retriggering of interrupts which prevent the system from making
    progress. If a hang is detected then the timer hardware is programmed with
    a certain delay into the future and a flag is set in the hrtimer cpu base
    which prevents newly enqueued timers from reprogramming the timer hardware
    prior to the chosen delay. The subsequent hrtimer interrupt after the delay
    clears the flag and resumes normal operation.

    If such a hang happens in the last hrtimer interrupt before a CPU is
    unplugged then the hang_detected flag is set and stays that way when the
    CPU is plugged in again. At that point the timer hardware is not armed and
    it cannot be armed because the hang_detected flag is still active, so
    nothing clears that flag. As a consequence the CPU does not receive hrtimer
    interrupts and no timers expire on that CPU which results in RCU stalls and
    other malfunctions.

    Clear the flag along with some other less critical members of the hrtimer
    cpu base to ensure starting from a clean state when a CPU is plugged in.

    Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
    root cause of that hard to reproduce heisenbug. Once understood it's
    trivial and certainly justifies a brown paperbag.

    Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
    Reported-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Sewior
    Cc: Anna-Maria Gleixner
    Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 5beda7d54eafece4c974cfa9fbb9f60fb18fd20a upstream.

    Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses
    large amounts of vmalloc space with PTI enabled.

    The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd
    folding code into account, so, on a 4-level kernel, the pgd synchronization
    logic compiles away to exactly nothing.

    Interestingly, the problem doesn't trigger with nopti. I assume this is
    because the kernel is mapped with global pages if we boot with nopti. The
    sequence of operations when we create a new task is that we first load its
    mm while still running on the old stack (which crashes if the old stack is
    unmapped in the new mm unless the TLB saves us), then we call
    prepare_switch_to(), and then we switch to the new stack.
    prepare_switch_to() pokes the new stack directly, which will populate the
    mapping through vmalloc_fault(). I assume that we're getting lucky on
    non-PTI systems -- the old stack's TLB entry stays alive long enough to
    make it all the way through prepare_switch_to() and switch_to() so that we
    make it to a valid stack.

    Fixes: b50858ce3e2a ("x86/mm/vmalloc: Add 5-level paging support")
    Reported-and-tested-by: Neil Berrington
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Cc: Konstantin Khlebnikov
    Cc: Dave Hansen
    Cc: Borislav Petkov
    Link: https://lkml.kernel.org/r/346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit 1d080f096fe33f031d26e19b3ef0146f66b8b0f1 upstream.

    Commit 24c2503255d3 ("x86/microcode: Do not access the initrd after it has
    been freed") fixed attempts to access initrd from the microcode loader
    after it has been freed. However, a similar KASAN warning was reported
    (stack trace edited):

    smpboot: Booting Node 0 Processor 1 APIC 0x11
    ==================================================================
    BUG: KASAN: use-after-free in find_cpio_data+0x9b5/0xa50
    Read of size 1 at addr ffff880035ffd000 by task swapper/1/0

    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.8-slack #7
    Hardware name: System manufacturer System Product Name/A88X-PLUS, BIOS 3003 03/10/2016
    Call Trace:
    dump_stack
    print_address_description
    kasan_report
    ? find_cpio_data
    __asan_report_load1_noabort
    find_cpio_data
    find_microcode_in_initrd
    __load_ucode_amd
    load_ucode_amd_ap
    load_ucode_ap

    After some investigation, it turned out that a merge was done using the
    wrong side to resolve, leading to picking up the previous state, before
    the 24c2503255d3 fix. Therefore the Fixes tag below contains a merge
    commit.

    Revert the mismerge by catching the save_microcode_in_initrd_amd()
    retval and thus letting the function exit with the last return statement
    so that initrd_gone can be set to true.

    Fixes: f26483eaedec ("Merge branch 'x86/urgent' into x86/microcode, to resolve conflicts")
    Reported-by:
    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=198295
    Link: https://lkml.kernel.org/r/20180123104133.918-2-bp@alien8.de
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit 7e702d17ed138cf4ae7c00e8c00681ed464587c7 upstream.

    Commit b94b73733171 ("x86/microcode/intel: Extend BDW late-loading with a
    revision check") reduced the impact of erratum BDF90 for Broadwell model
    79.

    The impact can be reduced further by checking the size of the last level
    cache portion per core.

    Tony: "The erratum says the problem only occurs on the large-cache SKUs.
    So we only need to avoid the update if we are on a big cache SKU that is
    also running old microcode."

    For more details, see erratum BDF90 in document #334165 (Intel Xeon
    Processor E7-8800/4800 v4 Product Family Specification Update) from
    September 2017.

    Fixes: b94b73733171 ("x86/microcode/intel: Extend BDW late-loading with a revision check")
    Signed-off-by: Jia Zhang
    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Acked-by: Tony Luck
    Link: https://lkml.kernel.org/r/1516321542-31161-1-git-send-email-zhang.jia@linux.alibaba.com
    Signed-off-by: Greg Kroah-Hartman

    Jia Zhang
     
  • commit 40d4071ce2d20840d224b4a77b5dc6f752c9ab15 upstream.

    The AMD power module can be loaded on non AMD platforms, but unload fails
    with the following Oops:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: __list_del_entry_valid+0x29/0x90
    Call Trace:
    perf_pmu_unregister+0x25/0xf0
    amd_power_pmu_exit+0x1c/0xd23 [power]
    SyS_delete_module+0x1a8/0x2b0
    ? exit_to_usermode_loop+0x8f/0xb0
    entry_SYSCALL_64_fastpath+0x20/0x83

    Return -ENODEV instead of 0 from the module init function if the CPU does
    not match.

    Fixes: c7ab62bfbe0e ("perf/x86/amd/power: Add AMD accumulated power reporting mechanism")
    Signed-off-by: Xiao Liang
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20180122061252.6394-1-xiliang@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Xiao Liang
     
  • [ Upstream commit 848b159835ddef99cc4193083f7e786c3992f580 ]

    with the introduction of commit
    b0eb57cb97e7837ebb746404c2c58c6f536f23fa, it appears that rq->buf_info
    is improperly handled. While it is heap allocated when an rx queue is
    setup, and freed when torn down, an old line of code in
    vmxnet3_rq_destroy was not properly removed, leading to rq->buf_info[0]
    being set to NULL prior to its being freed, causing a memory leak, which
    eventually exhausts the system on repeated create/destroy operations
    (for example, when the mtu of a vmxnet3 interface is changed
    frequently.

    Fix is pretty straight forward, just move the NULL set to after the
    free.

    Tested by myself with successful results

    Applies to net, and should likely be queued for stable, please

    Signed-off-by: Neil Horman
    Reported-By: boyang@redhat.com
    CC: boyang@redhat.com
    CC: Shrikrishna Khare
    CC: "VMware, Inc."
    CC: David S. Miller
    Acked-by: Shrikrishna Khare
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neil Horman
     
  • [ Upstream commit 6503a30440962f1e1ccb8868816b4e18201218d4 ]

    Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu
    versions of route lookup") broke "ip route get" in the presence
    of rules that specify iif lo.

    Host-originated traffic always has iif lo, because
    ip_route_output_key_hash and ip6_route_output_flags set the flow
    iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a
    convenient way to select only originated traffic and not forwarded
    traffic.

    inet_rtm_getroute used to match these rules correctly because
    even though it sets the flow iif to 0, it called
    ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX.
    But now that it calls ip_route_output_key_hash_rcu, the ifindex
    will remain 0 and not match the iif lo in the rule. As a result,
    "ip route get" will return ENETUNREACH.

    Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
    Tested: https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py passes again
    Signed-off-by: Lorenzo Colitti
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Lorenzo Colitti
     
  • [ Upstream commit 6db959c82eb039a151d95a0f8b7dea643657327a ]

    The current code copies directly from userspace to ctx->crypto_send, but
    doesn't always reinitialize it to 0 on failure. This causes any
    subsequent attempt to use this setsockopt to fail because of the
    TLS_CRYPTO_INFO_READY check, eventhough crypto_info is not actually
    ready.

    This should result in a correctly set up socket after the 3rd call, but
    currently it does not:

    size_t s = sizeof(struct tls12_crypto_info_aes_gcm_128);
    struct tls12_crypto_info_aes_gcm_128 crypto_good = {
    .info.version = TLS_1_2_VERSION,
    .info.cipher_type = TLS_CIPHER_AES_GCM_128,
    };

    struct tls12_crypto_info_aes_gcm_128 crypto_bad_type = crypto_good;
    crypto_bad_type.info.cipher_type = 42;

    setsockopt(sock, SOL_TLS, TLS_TX, &crypto_bad_type, s);
    setsockopt(sock, SOL_TLS, TLS_TX, &crypto_good, s - 1);
    setsockopt(sock, SOL_TLS, TLS_TX, &crypto_good, s);

    Fixes: 3c4d7559159b ("tls: kernel TLS support")
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit 877d17c79b66466942a836403773276e34fe3614 ]

    do_tls_setsockopt_tx returns 0 without doing anything when crypto_info
    is already set. Silent failure is confusing for users.

    Fixes: 3c4d7559159b ("tls: kernel TLS support")
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit cf6d43ef66f416282121f436ce1bee9a25199d52 ]

    During setsockopt(SOL_TCP, TLS_TX), if initialization of the software
    context fails in tls_set_sw_offload(), we leak sw_ctx. We also don't
    reassign ctx->priv_ctx to NULL, so we can't even do another attempt to
    set it up on the same socket, as it will fail with -EEXIST.

    Fixes: 3c4d7559159b ('tls: kernel TLS support')
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit d91c3e17f75f218022140dee18cf515292184a8f ]

    Calling accept on a TCP socket with a TLS ulp attached results
    in two sockets that share the same ulp context.
    The ulp context is freed while a socket is destroyed, so
    after one of the sockets is released, the second second will
    trigger a use after free when it tries to access the ulp context
    attached to it.
    We restrict the TLS ulp to sockets in ESTABLISHED state
    to prevent the scenario above.

    Fixes: 3c4d7559159b ("tls: kernel TLS support")
    Reported-by: syzbot+904e7cd6c5c741609228@syzkaller.appspotmail.com
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ilya Lesokhin
     
  • [ Upstream commit cd443f1e91ca600a092e780e8250cd6a2954b763 ]

    Move up the extack reset/initialization in netlink_rcv_skb, so that
    those 'goto ack' will not skip it. Otherwise, later on netlink_ack
    may use the uninitialized extack and cause kernel crash.

    Fixes: cbbdf8433a5f ("netlink: extack needs to be reset each time through loop")
    Reported-by: syzbot+03bee3680a37466775e7@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit 0d9c9f0f40ca262b67fc06a702b85f3976f5e1a1 ]

    sts variable is holding link speed as well as state. We should
    be using ls to index into ls_to_ethtool.

    Fixes: 265aeb511bd5 ("nfp: add support for .get_link_ksettings()")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jakub Kicinski
     
  • [ Upstream commit e58edaa4863583b54409444f11b4f80dff0af1cd ]

    Helmut reported a bug about division by zero while
    running traffic and doing physical cable pull test.

    When the cable unplugged the ppms become zero, so when
    dividing the current ppms by the previous ppms in the
    next dim iteration there is division by zero.

    This patch prevent this division for both ppms and epms.

    Fixes: c3164d2fc48f ("net/mlx5e: Added BW check for DIM decision mechanism")
    Reported-by: Helmut Grauer
    Signed-off-by: Talat Batheesh
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Talat Batheesh
     
  • [ Upstream commit cbbdf8433a5f117b1a2119ea30fc651b61ef7570 ]

    syzbot triggered the WARN_ON in netlink_ack testing the bad_attr value.
    The problem is that netlink_rcv_skb loops over the skb repeatedly invoking
    the callback and without resetting the extack leaving potentially stale
    data. Initializing each time through avoids the WARN_ON.

    Fixes: 2d4bc93368f5a ("netlink: extended ACK reporting")
    Reported-by: syzbot+315fa6766d0f7c359327@syzkaller.appspotmail.com
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 625637bf4afa45204bd87e4218645182a919485a ]

    After introducing sctp_stream structure, sctp uses stream->outcnt as the
    out stream nums instead of c.sinit_num_ostreams.

    However when users use sinit in cmsg, it only updates c.sinit_num_ostreams
    in sctp_sendmsg. At that moment, stream->outcnt is still using previous
    value. If it's value is not updated, the sinit_num_ostreams of sinit could
    not really work.

    This patch is to fix it by updating stream->outcnt and reiniting stream
    if stream outcnt has been change by sinit in sendmsg.

    Fixes: a83863174a61 ("sctp: prepare asoc stream for stream reconf")
    Signed-off-by: Xin Long
    Acked-by: Neil Horman
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit d0c081b49137cd3200f2023c0875723be66e7ce5 ]

    syzbot reported yet another crash [1] that is caused by
    insufficient validation of DODGY packets.

    Two bugs are happening here to trigger the crash.

    1) Flow dissection leaves with incorrect thoff field.

    2) skb_probe_transport_header() sets transport header to this invalid
    thoff, even if pointing after skb valid data.

    3) qdisc_pkt_len_init() reads out-of-bound data because it
    trusts tcp_hdrlen(skb)

    Possible fixes :

    - Full flow dissector validation before injecting bad DODGY packets in
    the stack.
    This approach was attempted here : https://patchwork.ozlabs.org/patch/
    861874/

    - Have more robust functions in the core.
    This might be needed anyway for stable versions.

    This patch fixes the flow dissection issue.

    [1]
    CPU: 1 PID: 3144 Comm: syzkaller271204 Not tainted 4.15.0-rc4-mm1+ #49
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:53
    print_address_description+0x73/0x250 mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:355 [inline]
    kasan_report+0x23b/0x360 mm/kasan/report.c:413
    __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:432
    __tcp_hdrlen include/linux/tcp.h:35 [inline]
    tcp_hdrlen include/linux/tcp.h:40 [inline]
    qdisc_pkt_len_init net/core/dev.c:3160 [inline]
    __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3554
    packet_snd net/packet/af_packet.c:2943 [inline]
    packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968
    sock_sendmsg_nosec net/socket.c:628 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:638
    sock_write_iter+0x31a/0x5d0 net/socket.c:907
    call_write_iter include/linux/fs.h:1776 [inline]
    new_sync_write fs/read_write.c:469 [inline]
    __vfs_write+0x684/0x970 fs/read_write.c:482
    vfs_write+0x189/0x510 fs/read_write.c:544
    SYSC_write fs/read_write.c:589 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:581
    entry_SYSCALL_64_fastpath+0x1f/0x96

    Fixes: 34fad54c2537 ("net: __skb_flow_dissect() must cap its return value")
    Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Reported-by: syzbot
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 4df0bfc79904b7169dc77dcce44598b1545721f9 ]

    tfile->tun could be detached before we close the tun fd,
    via tun_detach_all(), so it should not be used to check for
    tfile->tx_array.

    As Jason suggested, we probably have to clean it up
    unconditionally both in __tun_deatch() and tun_detach_all(),
    but this requires to check if it is initialized or not.
    Currently skb_array_cleanup() doesn't have such a check,
    so I check it in the caller and introduce a helper function,
    it is a bit ugly but we can always improve it in net-next.

    Reported-by: Dmitry Vyukov
    Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
    Cc: Jason Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit 1ecdaea02ca6bfacf2ecda500dc1af51e9780c42 ]

    Driver periodically samples all neighbors configured in device
    in order to update the kernel regarding their state. When finding
    an entry configured in HW that doesn't show in neigh_lookup()
    driver logs an error message.
    This introduces a race when removing multiple neighbors -
    it's possible that a given entry would still be configured in HW
    as its removal is still being processed but is already removed
    from the kernel's neighbor tables.

    Simply remove the error message and gracefully accept such events.

    Fixes: c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table")
    Fixes: 60f040ca11b9 ("mlxsw: spectrum_router: Periodically dump active IPv6 neighbours")
    Signed-off-by: Yuval Mintz
    Reviewed-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuval Mintz
     
  • [ Upstream commit 121d57af308d0cf943f08f4738d24d3966c38cd9 ]

    Validate gso_type during segmentation as SKB_GSO_DODGY sources
    may pass packets where the gso_type does not match the contents.

    Syzkaller was able to enter the SCTP gso handler with a packet of
    gso_type SKB_GSO_TCPV4.

    On entry of transport layer gso handlers, verify that the gso_type
    matches the transport protocol.

    Fixes: 90017accff61 ("sctp: Add GSO support")
    Link: http://lkml.kernel.org/r/
    Reported-by: syzbot+fee64147a25aecd48055@syzkaller.appspotmail.com
    Signed-off-by: Willem de Bruijn
    Acked-by: Jason Wang
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 128bb975dc3c25d00de04e503e2fe0a780d04459 ]

    Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path,
    call common GRE functions") moved dev->mtu initialization
    from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
    result, the previously set values, before ndo_init(), are
    reset in the following cases:

    * rtnl_create_link() can update dev->mtu from IFLA_MTU
    parameter.

    * ip6gre_tnl_link_config() is invoked before ndo_init() in
    netlink and ioctl setup, so ndo_init() can reset MTU
    adjustments with the lower device MTU as well, dev->mtu
    and dev->hard_header_len.

    Not applicable for ip6gretap because it has one more call
    to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().

    Fix the first case by updating dev->mtu with 'tb[IFLA_MTU]'
    parameter if a user sets it manually on a device creation,
    and fix the second one by moving ip6gre_tnl_link_config()
    call after register_netdevice().

    Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
    Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting")
    Signed-off-by: Alexey Kodanev
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kodanev
     
  • [ Upstream commit 52acf06451930eb4cefabd5ecea56e2d46c32f76 ]

    The commit 622190669403 ("be2net: Request RSS capability of Rx interface
    depending on number of Rx rings") modified be_update_queues() so the
    IFACE (HW representation of the netdevice) is destroyed and then
    re-created. This causes a regression because potential promiscuous mode
    is not restored properly during be_open() because the driver thinks
    that the HW has promiscuous mode already enabled.

    Note that Lancer is not affected by this bug because RX-filter flags are
    disabled during be_close() for this chipset.

    Cc: Sathya Perla
    Cc: Ajit Khaparde
    Cc: Sriharsha Basavapatna
    Cc: Somnath Kotur

    Fixes: 622190669403 ("be2net: Request RSS capability of Rx interface depending on number of Rx rings")
    Signed-off-by: Ivan Vecera
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ivan Vecera
     
  • [ Upstream commit 0171c41835591e9aa2e384b703ef9a6ae367c610 ]

    ppp_dev_uninit(), which is the .ndo_uninit() handler of PPP devices,
    needs to lock pn->all_ppp_mutex. Therefore we mustn't call
    register_netdevice() with pn->all_ppp_mutex already locked, or we'd
    deadlock in case register_netdevice() fails and calls .ndo_uninit().

    Fortunately, we can unlock pn->all_ppp_mutex before calling
    register_netdevice(). This lock protects pn->units_idr, which isn't
    used in the device registration process.

    However, keeping pn->all_ppp_mutex locked during device registration
    did ensure that no device in transient state would be published in
    pn->units_idr. In practice, unlocking it before calling
    register_netdevice() doesn't change this property: ppp_unit_register()
    is called with 'ppp_mutex' locked and all searches done in
    pn->units_idr hold this lock too.

    Fixes: 8cb775bc0a34 ("ppp: fix device unregistration upon netns deletion")
    Reported-and-tested-by: syzbot+367889b9c9e279219175@syzkaller.appspotmail.com
    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Guillaume Nault
     
  • [ Upstream commit 05e0cc84e00c54fb152d1f4b86bc211823a83d0c ]

    mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
    reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
    API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
    mlx5 users. To fix this, this patch provides an alternative way to
    retrieve IRQ vector affinity using legacy IRQ API, following
    smp_affinity read procfs implementation.

    Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic code")
    Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
    Cc: Sagi Grimberg
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Greg Kroah-Hartman

    Saeed Mahameed
     
  • [ Upstream commit 8978cc921fc7fad3f4d6f91f1da01352aeeeff25 ]

    There are systems platform information management interfaces (such as
    HOST2BMC) for which we cannot disable local loopback multicast traffic.

    Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
    driver will not disable multicast loopback traffic if not supported.
    (It is expected that Firmware will not set disable_local_lb_mc if
    HOST2BMC is running for example.)

    Function mlx5_nic_vport_update_local_lb will do best effort to
    disable/enable UC/MC loopback traffic and return success only in case it
    succeeded to changed all allowed by Firmware.

    Adapt mlx5_ib and mlx5e to support the new cap bits.

    Fixes: 2c43c5a036be ("net/mlx5e: Enable local loopback in loopback selftest")
    Fixes: c85023e153e3 ("IB/mlx5: Add raw ethernet local loopback support")
    Fixes: bded747bb432 ("net/mlx5: Add raw ethernet local loopback firmware command")
    Signed-off-by: Eran Ben Elisha
    Cc: kernel-team@fb.com
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Greg Kroah-Hartman

    Eran Ben Elisha
     
  • [ Upstream commit 59b36613e85fb16ebf9feaf914570879cd5c2a21 ]

    When tipc_node_find_by_name() fails, the nlmsg is not
    freed.

    While on it, switch to a goto label to properly
    free it.

    Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node")
    Reported-by: Dmitry Vyukov
    Cc: Jon Maloy
    Cc: Ying Xue
    Signed-off-by: Cong Wang
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit a0ff660058b88d12625a783ce9e5c1371c87951f ]

    After commit cea0cc80a677 ("sctp: use the right sk after waking up from
    wait_buf sleep"), it may change to lock another sk if the asoc has been
    peeled off in sctp_wait_for_sndbuf.

    However, the asoc's new sk could be already closed elsewhere, as it's in
    the sendmsg context of the old sk that can't avoid the new sk's closing.
    If the sk's last one refcnt is held by this asoc, later on after putting
    this asoc, the new sk will be freed, while under it's own lock.

    This patch is to revert that commit, but fix the old issue by returning
    error under the old sk's lock.

    Fixes: cea0cc80a677 ("sctp: use the right sk after waking up from wait_buf sleep")
    Reported-by: syzbot+ac6ea7baa4432811eb50@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit c5006b8aa74599ce19104b31d322d2ea9ff887cc ]

    The check in sctp_sockaddr_af is not robust enough to forbid binding a
    v4mapped v6 addr on a v4 socket.

    The worse thing is that v4 socket's bind_verify would not convert this
    v4mapped v6 addr to a v4 addr. syzbot even reported a crash as the v4
    socket bound a v6 addr.

    This patch is to fix it by doing the common sa.sa_family check first,
    then AF_INET check for v4mapped v6 addrs.

    Fixes: 7dab83de50c7 ("sctp: Support ipv6only AF_INET6 sockets.")
    Reported-by: syzbot+7b7b518b1228d2743963@syzkaller.appspotmail.com
    Acked-by: Neil Horman
    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit a78e93661c5fd30b9e1dee464b2f62f966883ef7 ]

    Hardware statistics retrieval hurts in tight invocation loops.

    Avoid extraneous write and enforce strict ordering of writes targeted to
    the tally counters dump area address registers.

    Signed-off-by: Francois Romieu
    Tested-by: Oliver Freyermuth
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Francois Romieu
     
  • [ Upstream commit 02612bb05e51df8489db5e94d0cf8d1c81f87b0c ]

    In pppoe_sendmsg(), reserving dev->hard_header_len bytes of headroom
    was probably fine before the introduction of ->needed_headroom in
    commit f5184d267c1a ("net: Allow netdevices to specify needed head/tailroom").

    But now, virtual devices typically advertise the size of their overhead
    in dev->needed_headroom, so we must also take it into account in
    skb_reserve().
    Allocation size of skb is also updated to take dev->needed_tailroom
    into account and replace the arbitrary 32 bytes with the real size of
    a PPPoE header.

    This issue was discovered by syzbot, who connected a pppoe socket to a
    gre device which had dev->header_ops->create == ipgre_header and
    dev->hard_header_len == 0. Therefore, PPPoE didn't reserve any
    headroom, and dev_hard_header() crashed when ipgre_header() tried to
    prepend its header to skb->data.

    skbuff: skb_under_panic: text:000000001d390b3a len:31 put:24
    head:00000000d8ed776f data:000000008150e823 tail:0x7 end:0xc0 dev:gre0
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:104!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 3670 Comm: syzkaller801466 Not tainted
    4.15.0-rc7-next-20180115+ #97
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:skb_panic+0x162/0x1f0 net/core/skbuff.c:100
    RSP: 0018:ffff8801d9bd7840 EFLAGS: 00010282
    RAX: 0000000000000083 RBX: ffff8801d4f083c0 RCX: 0000000000000000
    RDX: 0000000000000083 RSI: 1ffff1003b37ae92 RDI: ffffed003b37aefc
    RBP: ffff8801d9bd78a8 R08: 1ffff1003b37ae8a R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff86200de0
    R13: ffffffff84a981ad R14: 0000000000000018 R15: ffff8801d2d34180
    FS: 00000000019c4880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000208bc000 CR3: 00000001d9111001 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    skb_under_panic net/core/skbuff.c:114 [inline]
    skb_push+0xce/0xf0 net/core/skbuff.c:1714
    ipgre_header+0x6d/0x4e0 net/ipv4/ip_gre.c:879
    dev_hard_header include/linux/netdevice.h:2723 [inline]
    pppoe_sendmsg+0x58e/0x8b0 drivers/net/ppp/pppoe.c:890
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:640
    sock_write_iter+0x31a/0x5d0 net/socket.c:909
    call_write_iter include/linux/fs.h:1775 [inline]
    do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
    do_iter_write+0x154/0x540 fs/read_write.c:932
    vfs_writev+0x18a/0x340 fs/read_write.c:977
    do_writev+0xfc/0x2a0 fs/read_write.c:1012
    SYSC_writev fs/read_write.c:1085 [inline]
    SyS_writev+0x27/0x30 fs/read_write.c:1082
    entry_SYSCALL_64_fastpath+0x29/0xa0

    Admittedly PPPoE shouldn't be allowed to run on non Ethernet-like
    interfaces, but reserving space for ->needed_headroom is a more
    fundamental issue that needs to be addressed first.

    Same problem exists for __pppoe_xmit(), which also needs to take
    dev->needed_headroom into account in skb_cow_head().

    Fixes: f5184d267c1a ("net: Allow netdevices to specify needed head/tailroom")
    Reported-by: syzbot+ed0838d0fa4c4f2b528e20286e6dc63effc7c14d@syzkaller.appspotmail.com
    Signed-off-by: Guillaume Nault
    Reviewed-by: Xin Long
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Guillaume Nault
     
  • [ Upstream commit 1e19c4d689dc1e95bafd23ef68fbc0c6b9e05180 ]

    Sukumar reported that sends to the local broadcast address
    (255.255.255.255) are broken. Check for the address in vrf driver
    and do not redirect to the VRF device - similar to multicast
    packets.

    With this change sockets can use SO_BINDTODEVICE to specify an
    egress interface and receive responses. Note: the egress interface
    can not be a VRF device but needs to be the enslaved device.

    https://bugzilla.kernel.org/show_bug.cgi?id=198521

    Reported-by: Sukumar Gopalakrishnan
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 30be8f8dba1bd2aff73e8447d59228471233a3d4 ]

    sendfile() calls can hang endless with using Kernel TLS if a socket error occurs.
    Socket error codes must be inverted by Kernel TLS before returning because
    they are stored with positive sign. If returned non-inverted they are
    interpreted as number of bytes sent, causing endless looping of the
    splice mechanic behind sendfile().

    Signed-off-by: Robert Hering
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    r.hering@avm.de
     
  • [ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

    When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.

    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open. However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open. In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence. The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.

    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.

    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     
  • [ Upstream commit 7c68d1a6b4db9012790af7ac0f0fdc0d2083422a ]

    Without proper validation of DODGY packets, we might very well
    feed qdisc_pkt_len_init() with invalid GSO packets.

    tcp_hdrlen() might access out-of-bound data, so let's use
    skb_header_pointer() and proper checks.

    Whole story is described in commit d0c081b49137 ("flow_dissector:
    properly cap thoff field")

    We have the goal of validating DODGY packets earlier in the stack,
    so we might very well revert this fix in the future.

    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Cc: Jason Wang
    Reported-by: syzbot+9da69ebac7dddd804552@syzkaller.appspotmail.com
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ad23b750933ea7bf962678972a286c78a8fa36aa ]

    Commit "net: igmp: Use correct source address on IGMPv3 reports"
    introduced a check to validate the source address of locally generated
    IGMPv3 packets.
    Instead of checking the local interface address directly, it uses
    inet_ifa_match(fl4->saddr, ifa), which checks if the address is on the
    local subnet (or equal to the point-to-point address if used).

    This breaks for point-to-point interfaces, so check against
    ifa->ifa_local directly.

    Cc: Kevin Cernekee
    Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 reports")
    Reported-by: Sebastian Gottschall
    Signed-off-by: Felix Fietkau
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Felix Fietkau
     
  • [ Upstream commit a5b1379afbfabf91e3a689e82ac619a7157336b3 ]

    Fix initialize the uninitialized tx_qlen to an appropriate value when USB
    Full Speed is used.

    Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
    Signed-off-by: Yuiko Oshino
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuiko Oshino
     
  • [ Upstream commit 95ef498d977bf44ac094778fd448b98af158a3e6 ]

    In my last patch, I missed fact that cork.base.dst was not initialized
    in ip6_make_skb() :

    If ip6_setup_cork() returns an error, we might attempt a dst_release()
    on some random pointer.

    Fixes: 862c03ee1deb ("ipv6: fix possible mem leaks in ipv6_make_skb()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 749439bfac6e1a2932c582e2699f91d329658196 ]

    The logic in __ip6_append_data() assumes that the MTU is at least large
    enough for the headers. A device's MTU may be adjusted after being
    added while sendmsg() is processing data, resulting in
    __ip6_append_data() seeing any MTU. For an mtu smaller than the size of
    the fragmentation header, the math results in a negative 'maxfraglen',
    which causes problems when refragmenting any previous skb in the
    skb_write_queue, leaving it possibly malformed.

    Instead sendmsg returns EINVAL when the mtu is calculated to be less
    than IPV6_MIN_MTU.

    Found by syzkaller:
    kernel BUG at ./include/linux/skbuff.h:2064!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 1 PID: 14216 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #2
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    task: ffff8801d0b68580 task.stack: ffff8801ac6b8000
    RIP: 0010:__skb_pull include/linux/skbuff.h:2064 [inline]
    RIP: 0010:__ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617
    RSP: 0018:ffff8801ac6bf570 EFLAGS: 00010216
    RAX: 0000000000010000 RBX: 0000000000000028 RCX: ffffc90003cce000
    RDX: 00000000000001b8 RSI: ffffffff839df06f RDI: ffff8801d9478ca0
    RBP: ffff8801ac6bf780 R08: ffff8801cc3f1dbc R09: 0000000000000000
    R10: ffff8801ac6bf7a0 R11: 43cb4b7b1948a9e7 R12: ffff8801cc3f1dc8
    R13: ffff8801cc3f1d40 R14: 0000000000001036 R15: dffffc0000000000
    FS: 00007f43d740c700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f7834984000 CR3: 00000001d79b9000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ip6_finish_skb include/net/ipv6.h:911 [inline]
    udp_v6_push_pending_frames+0x255/0x390 net/ipv6/udp.c:1093
    udpv6_sendmsg+0x280d/0x31a0 net/ipv6/udp.c:1363
    inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x352/0x5a0 net/socket.c:1750
    SyS_sendto+0x40/0x50 net/socket.c:1718
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x4512e9
    RSP: 002b:00007f43d740bc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00000000007180a8 RCX: 00000000004512e9
    RDX: 000000000000002e RSI: 0000000020d08000 RDI: 0000000000000005
    RBP: 0000000000000086 R08: 00000000209c1000 R09: 000000000000001c
    R10: 0000000000040800 R11: 0000000000000216 R12: 00000000004b9c69
    R13: 00000000ffffffff R14: 0000000000000005 R15: 00000000202c2000
    Code: 9e 01 fe e9 c5 e8 ff ff e8 7f 9e 01 fe e9 4a ea ff ff 48 89 f7 e8 52 9e 01 fe e9 aa eb ff ff e8 a8 b6 cf fd 0f 0b e8 a1 b6 cf fd 0b 49 8d 45 78 4d 8d 45 7c 48 89 85 78 fe ff ff 49 8d 85 ba
    RIP: __skb_pull include/linux/skbuff.h:2064 [inline] RSP: ffff8801ac6bf570
    RIP: __ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617 RSP: ffff8801ac6bf570

    Reported-by: syzbot
    Signed-off-by: Mike Maloney
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Mike Maloney
     
  • [ Upstream commit e9191ffb65d8e159680ce0ad2224e1acbde6985c ]

    Commit 513674b5a2c9 ("net: reevalulate autoflowlabel setting after
    sysctl setting") removed the initialisation of
    ipv6_pinfo::autoflowlabel and added a second flag to indicate
    whether this field or the net namespace default should be used.

    The getsockopt() handling for this case was not updated, so it
    currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
    not explicitly enabled. Fix it to return the effective value, whether
    that has been set at the socket or net namespace level.

    Fixes: 513674b5a2c9 ("net: reevalulate autoflowlabel setting after sysctl ...")
    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings