02 Feb, 2019

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2019-01-31

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) disable preemption in sender side of socket filters, from Alexei.

    2) fix two potential deadlocks in syscall bpf lookup and prog_register,
    from Martin and Alexei.

    3) fix BTF to allow typedef on func_proto, from Yonghong.

    4) two bpftool fixes, from Jiri and Paolo.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jan, 2019

1 commit

  • Assign a default net namespace to netdevs created by init_dummy_netdev().
    Fixes a NULL pointer dereference caused by busy-polling a socket bound to
    an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
    if napi_poll() received packets:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
    IP: napi_busy_loop+0xd6/0x200
    Call Trace:
    sock_poll+0x5e/0x80
    do_sys_poll+0x324/0x5a0
    SyS_poll+0x6c/0xf0
    do_syscall_64+0x6b/0x1f0
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Fixes: 7db6b048da3b ("net: Commonize busy polling code to focus on napi_id instead of socket")
    Signed-off-by: Josh Elsasser
    Signed-off-by: David S. Miller

    Josh Elsasser
     

29 Jan, 2019

1 commit

  • Despite having stopped the parser, we still need to deinitialize it
    by calling strp_done so that it cancels its work. Otherwise the worker
    thread can run after we have freed the parser, and attempt to access
    its workqueue resulting in a use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in pwq_activate_delayed_work+0x1b/0x1d0
    Read of size 8 at addr ffff888069975240 by task kworker/u2:2/93

    CPU: 0 PID: 93 Comm: kworker/u2:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
    Workqueue: (null) (kstrp)
    Call Trace:
    print_address_description+0x6e/0x2b0
    ? pwq_activate_delayed_work+0x1b/0x1d0
    kasan_report+0xfd/0x177
    ? pwq_activate_delayed_work+0x1b/0x1d0
    ? pwq_activate_delayed_work+0x1b/0x1d0
    pwq_activate_delayed_work+0x1b/0x1d0
    ? process_one_work+0x4aa/0x660
    pwq_dec_nr_in_flight+0x9b/0x100
    worker_thread+0x82/0x680
    ? process_one_work+0x660/0x660
    kthread+0x1b9/0x1e0
    ? __kthread_create_on_node+0x250/0x250
    ret_from_fork+0x1f/0x30

    Allocated by task 111:
    sk_psock_init+0x3c/0x1b0
    sock_map_link.isra.2+0x103/0x4b0
    sock_map_update_common+0x94/0x270
    sock_map_update_elem+0x145/0x160
    __se_sys_bpf+0x152e/0x1e10
    do_syscall_64+0xb2/0x3e0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 112:
    kfree+0x7f/0x140
    process_one_work+0x40b/0x660
    worker_thread+0x82/0x680
    kthread+0x1b9/0x1e0
    ret_from_fork+0x1f/0x30

    The buggy address belongs to the object at ffff888069975180
    which belongs to the cache kmalloc-512 of size 512
    The buggy address is located 192 bytes inside of
    512-byte region [ffff888069975180, ffff888069975380)
    The buggy address belongs to the page:
    page:ffffea0001a65d00 count:1 mapcount:0 mapping:ffff88806d401280 index:0x0 compound_mapcount: 0
    flags: 0x4000000000010200(slab|head)
    raw: 4000000000010200 dead000000000100 dead000000000200 ffff88806d401280
    raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888069975100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff888069975180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff888069975200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff888069975280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888069975300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    Reported-by: Marek Majkowski
    Signed-off-by: Jakub Sitnicki
    Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Jakub Sitnicki
     

23 Jan, 2019

1 commit


20 Jan, 2019

2 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-01-20

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Fix a out-of-bounds access in __bpf_redirect_no_mac, from Willem.

    2) Fix bpf_setsockopt to reset sock dst on SO_MARK changes, from Peter.

    3) Fix map in map masking to prevent out-of-bounds access under
    speculative execution, from Daniel.

    4) Fix bpf_setsockopt's SO_MAX_PACING_RATE to support TCP internal
    pacing, from Yuchung.

    5) Fix json writer license in bpftool, from Thomas.

    6) Fix AF_XDP to check if an actually queue exists during umem
    setup, from Krzysztof.

    7) Several fixes to BPF stackmap's build id handling. Another fix
    for bpftool build to account for libbfd variations wrt linking
    requirements, from Stanislav.

    8) Fix BPF samples build with clang by working around missing asm
    goto, from Yonghong.

    9) Fix libbpf to retry program load on signal interrupt, from Lorenz.

    10) Various minor compile warning fixes in BPF code, from Mathieu.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Syzkaller was able to construct a packet of negative length by
    redirecting from bpf_prog_test_run_skb with BPF_PROG_TYPE_LWT_XMIT:

    BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:345 [inline]
    BUG: KASAN: slab-out-of-bounds in skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
    BUG: KASAN: slab-out-of-bounds in __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
    Read of size 4294967282 at addr ffff8801d798009c by task syz-executor2/12942

    kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
    check_memory_region_inline mm/kasan/kasan.c:260 [inline]
    check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
    memcpy+0x23/0x50 mm/kasan/kasan.c:302
    memcpy include/linux/string.h:345 [inline]
    skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
    __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
    __pskb_copy include/linux/skbuff.h:1053 [inline]
    pskb_copy include/linux/skbuff.h:2904 [inline]
    skb_realloc_headroom+0xe7/0x120 net/core/skbuff.c:1539
    ipip6_tunnel_xmit net/ipv6/sit.c:965 [inline]
    sit_tunnel_xmit+0xe1b/0x30d0 net/ipv6/sit.c:1029
    __netdev_start_xmit include/linux/netdevice.h:4325 [inline]
    netdev_start_xmit include/linux/netdevice.h:4334 [inline]
    xmit_one net/core/dev.c:3219 [inline]
    dev_hard_start_xmit+0x295/0xc90 net/core/dev.c:3235
    __dev_queue_xmit+0x2f0d/0x3950 net/core/dev.c:3805
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
    __bpf_tx_skb net/core/filter.c:2016 [inline]
    __bpf_redirect_common net/core/filter.c:2054 [inline]
    __bpf_redirect+0x5cf/0xb20 net/core/filter.c:2061
    ____bpf_clone_redirect net/core/filter.c:2094 [inline]
    bpf_clone_redirect+0x2f6/0x490 net/core/filter.c:2066
    bpf_prog_41f2bcae09cd4ac3+0xb25/0x1000

    The generated test constructs a packet with mac header, network
    header, skb->data pointing to network header and skb->len 0.

    Redirecting to a sit0 through __bpf_redirect_no_mac pulls the
    mac length, even though skb->data already is at skb->network_header.
    bpf_prog_test_run_skb has already pulled it as LWT_XMIT !is_l2.

    Update the offset calculation to pull only if skb->data differs
    from skb->network_header, which is not true in this case.

    The test itself can be run only from commit 1cf1cae963c2 ("bpf:
    introduce BPF_PROG_TEST_RUN command"), but the same type of packets
    with skb at network header could already be built from lwt xmit hooks,
    so this fix is more relevant to that commit.

    Also set the mac header on redirect from LWT_XMIT, as even after this
    change to __bpf_redirect_no_mac that field is expected to be set, but
    is not yet in ip_finish_output2.

    Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Acked-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann

    Willem de Bruijn
     

18 Jan, 2019

3 commits


17 Jan, 2019

2 commits


16 Jan, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
    Gautier.

    2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.

    3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
    Brivio.

    4) Use after free in hns driver, from Yonglong Liu.

    5) icmp6_send() needs to handle the case of NULL skb, from Eric
    Dumazet.

    6) Missing rcu read lock in __inet6_bind() when operating on mapped
    addresses, from David Ahern.

    7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.

    8) Fix PHY vs r8169 module loading ordering issues, from Heiner
    Kallweit.

    9) Fix bridge vlan memory leak, from Ido Schimmel.

    10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.

    11) Infoleak in ipv6_local_error(), flow label isn't completely
    initialized. From Eric Dumazet.

    12) Handle mv88e6390 errata, from Andrew Lunn.

    13) Making vhost/vsock CID hashing consistent, from Zha Bin.

    14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.

    15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
    bnxt_en: Fix context memory allocation.
    bnxt_en: Fix ring checking logic on 57500 chips.
    mISDN: hfcsusb: Use struct_size() in kzalloc()
    net: clear skb->tstamp in bridge forwarding path
    net: bpfilter: disallow to remove bpfilter module while being used
    net: bpfilter: restart bpfilter_umh when error occurred
    net: bpfilter: use cleanup callback to release umh_info
    umh: add exit routine for UMH process
    isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
    vhost/vsock: fix vhost vsock cid hashing inconsistent
    net: stmmac: Prevent RX starvation in stmmac_napi_poll()
    net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
    net: stmmac: Check if CBS is supported before configuring
    net: stmmac: dwxgmac2: Only clear interrupts that are active
    net: stmmac: Fix PCI module removal leak
    tools/bpf: fix bpftool map dump with bitfields
    tools/bpf: test btf bitfield with >=256 struct member offset
    bpf: fix bpffs bitfield pretty print
    net: ethernet: mediatek: fix warning in phy_start_aneg
    tcp: change txhash on SYN-data timeout
    ...

    Linus Torvalds
     

12 Jan, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-01-11

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Fix TCP-BPF support for correctly setting the initial window
    via TCP_BPF_IW on an active TFO sender, from Yuchung.

    2) Fix a panic in BPF's stack_map_get_build_id()'s ELF parsing on
    32 bit archs caused by page_address() returning NULL, from Song.

    3) Fix BTF pretty print in kernel and bpftool when bitfield member
    offset is greater than 256. Also add test cases, from Yonghong.

    4) Fix improper argument handling in xdp1 sample, from Ioana.

    5) Install missing tcp_server.py and tcp_client.py files from
    BPF selftests, from Anders.

    6) Add test_libbpf to gitignore in libbpf and BPF selftests,
    from Stanislav.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jan, 2019

2 commits

  • This fixes false-positive kmemleak reports about leaked neighbour entries:

    unreferenced object 0xffff8885c6e4d0a8 (size 1024):
    comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff ........ ,......
    08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00 ..._......}.....
    backtrace:
    [] ip6_finish_output2+0x887/0x1e40
    [] ip6_output+0x1ba/0x600
    [] ip6_send_skb+0x92/0x2f0
    [] udp_v6_send_skb.isra.24+0x680/0x15e0
    [] udpv6_sendmsg+0x18c9/0x27a0
    [] sock_sendmsg+0xb3/0xf0
    [] ___sys_sendmsg+0x745/0x8f0
    [] __sys_sendmsg+0xde/0x170
    [] do_syscall_64+0x9b/0x400
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     
  • The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
    to work on (active) Fast Open sender. This is because it changes the
    (initial) window only if data_segs_out is zero -- but data_segs_out
    is also incremented on SYN-data. This patch fixes the issue by
    proerly accounting for SYN-data additionally.

    Fixes: fc7478103c84 ("bpf: Adds support for setting initial cwnd")
    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Yuchung Cheng
     

06 Jan, 2019

1 commit

  • Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

    The jump label is controlled by HAVE_JUMP_LABEL, which is defined
    like this:

    #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
    # define HAVE_JUMP_LABEL
    #endif

    We can improve this by testing 'asm goto' support in Kconfig, then
    make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

    Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
    match to the real kernel capability.

    Signed-off-by: Masahiro Yamada
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Sedat Dilek

    Masahiro Yamada
     

05 Jan, 2019

1 commit

  • Commit dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by
    __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
    alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
    directly reclaim.

    The previous behavior would require reclaim up to 1 << order pages for
    skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
    otherwise the allocations in alloc_skb() would loop in the page allocator
    looking for memory. __GFP_RETRY_MAYFAIL makes both allocations failable
    under memory pressure, including for the HEAD allocation.

    This can cause, among many other things, write() to fail with ENOTCONN
    during RPC when under memory pressure.

    These allocations should succeed as they did previous to dcda9b04713c
    even if it requires calling the oom killer and additional looping in the
    page allocator to find memory. There is no way to specify the previous
    behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
    previous behavior only guaranteed that 1 << order pages would be reclaimed
    before failing for order > PAGE_ALLOC_COSTLY_ORDER. That reclaim is not
    guaranteed to be contiguous memory, so repeating for such large orders is
    usually not beneficial.

    Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
    behavior, specifically not allowing alloc_skb() to fail for small orders
    and oom kill if necessary rather than allowing RPCs to fail.

    Fixes: dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
    Signed-off-by: David Rientjes
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David Rientjes
     

04 Jan, 2019

1 commit

  • Pull networking fixes from David Miller:
    "Several fixes here. Basically split down the line between newly
    introduced regressions and long existing problems:

    1) Double free in tipc_enable_bearer(), from Cong Wang.

    2) Many fixes to nf_conncount, from Florian Westphal.

    3) op->get_regs_len() can throw an error, check it, from Yunsheng
    Lin.

    4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
    driver, from Scott Wood.

    5) Inifnite loop in fib_empty_table(), from Yue Haibing.

    6) Use after free in ax25_fillin_cb(), from Cong Wang.

    7) Fix socket locking in nr_find_socket(), also from Cong Wang.

    8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.

    9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.

    10) Fix ptr_ring wrap during queue swap, from Cong Wang.

    11) Missing shutdown callback in hinic driver, from Xue Chaojing.

    12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
    Brivio.

    13) BPF out of bounds speculation fixes from Daniel Borkmann"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
    ipv6: Consider sk_bound_dev_if when binding a socket to an address
    ipv6: Fix dump of specific table with strict checking
    bpf: add various test cases to selftests
    bpf: prevent out of bounds speculation on pointer arithmetic
    bpf: fix check_map_access smin_value test when pointer contains offset
    bpf: restrict unknown scalars of mixed signed bounds for unprivileged
    bpf: restrict stack pointer arithmetic for unprivileged
    bpf: restrict map value pointer arithmetic for unprivileged
    bpf: enable access to ax register also from verifier rewrite
    bpf: move tmp variable into ax register in interpreter
    bpf: move {prev_,}insn_idx into verifier env
    isdn: fix kernel-infoleak in capi_unlocked_ioctl
    ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
    net/hamradio/6pack: use mod_timer() to rearm timers
    net-next/hinic:add shutdown callback
    net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
    ip: validate header length on virtual device xmit
    tap: call skb_probe_transport_header after setting skb->dev
    ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
    net: rds: remove unnecessary NULL check
    ...

    Linus Torvalds
     

02 Jan, 2019

1 commit

  • Al Viro mentioned (Message-ID
    )
    that there is probably a race condition
    lurking in accesses of sk_stamp on 32-bit machines.

    sock->sk_stamp is of type ktime_t which is always an s64.
    On a 32 bit architecture, we might run into situations of
    unsafe access as the access to the field becomes non atomic.

    Use seqlocks for synchronization.
    This allows us to avoid using spinlocks for readers as
    readers do not need mutual exclusion.

    Another approach to solve this is to require sk_lock for all
    modifications of the timestamps. The current approach allows
    for timestamps to have their own lock: sk_stamp_lock.
    This allows for the patch to not compete with already
    existing critical sections, and side effects are limited
    to the paths in the patch.

    The addition of the new field maintains the data locality
    optimizations from
    commit 9115e8cd2a0c ("net: reorganize struct sock for better data
    locality")

    Note that all the instances of the sk_stamp accesses
    are either through the ioctl or the syscall recvmsg.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: David S. Miller

    Deepa Dinamani
     

31 Dec, 2018

1 commit

  • We must have an address to lookup otherwise we'll derefence a null
    pointer in the ndo_fdb_get callbacks.

    CC: Roopa Prabhu
    CC: David Ahern
    Reported-by: syzbot+017b1f61c82a1c3e7efd@syzkaller.appspotmail.com
    Fixes: 5b2f94b27622 ("net: rtnetlink: support for fdb get")
    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

29 Dec, 2018

2 commits

  • The return type for get_regs_len in struct ethtool_ops is int,
    the hns3 driver may return error when failing to get the regs
    len by sending cmd to firmware.

    Signed-off-by: Yunsheng Lin
    Signed-off-by: David S. Miller

    Yunsheng Lin
     
  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     

28 Dec, 2018

1 commit

  • Pull networking updates from David Miller:

    1) New ipset extensions for matching on destination MAC addresses, from
    Stefano Brivio.

    2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
    nfp driver. From Stefano Brivio.

    3) Implement GRO for plain UDP sockets, from Paolo Abeni.

    4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
    bit so that we could support the entire vlan_tci value.

    5) Rework the IPSEC policy lookups to better optimize more usecases,
    from Florian Westphal.

    6) Infrastructure changes eliminating direct manipulation of SKB lists
    wherever possible, and to always use the appropriate SKB list
    helpers. This work is still ongoing...

    7) Lots of PHY driver and state machine improvements and
    simplifications, from Heiner Kallweit.

    8) Various TSO deferral refinements, from Eric Dumazet.

    9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.

    10) Batch dropping of XDP packets in tuntap, from Jason Wang.

    11) Lots of cleanups and improvements to the r8169 driver from Heiner
    Kallweit, including support for ->xmit_more. This driver has been
    getting some much needed love since he started working on it.

    12) Lots of new forwarding selftests from Petr Machata.

    13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.

    14) Packed ring support for virtio, from Tiwei Bie.

    15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.

    16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.

    17) Implement coalescing on TCP backlog queue, from Eric Dumazet.

    18) Implement carrier change in tun driver, from Nicolas Dichtel.

    19) Support msg_zerocopy in UDP, from Willem de Bruijn.

    20) Significantly improve garbage collection of neighbor objects when
    the table has many PERMANENT entries, from David Ahern.

    21) Remove egdev usage from nfp and mlx5, and remove the facility
    completely from the tree as it no longer has any users. From Oz
    Shlomo and others.

    22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
    therefore abort the operation before the commit phase (which is the
    NETDEV_CHANGEADDR event). From Petr Machata.

    23) Add indirect call wrappers to avoid retpoline overhead, and use them
    in the GRO code paths. From Paolo Abeni.

    24) Add support for netlink FDB get operations, from Roopa Prabhu.

    25) Support bloom filter in mlxsw driver, from Nir Dotan.

    26) Add SKB extension infrastructure. This consolidates the handling of
    the auxiliary SKB data used by IPSEC and bridge netfilter, and is
    designed to support the needs to MPTCP which could be integrated in
    the future.

    27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
    net: dccp: fix kernel crash on module load
    drivers/net: appletalk/cops: remove redundant if statement and mask
    bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
    net/net_namespace: Check the return value of register_pernet_subsys()
    net/netlink_compat: Fix a missing check of nla_parse_nested
    ieee802154: lowpan_header_create check must check daddr
    net/mlx4_core: drop useless LIST_HEAD
    mlxsw: spectrum: drop useless LIST_HEAD
    net/mlx5e: drop useless LIST_HEAD
    iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
    net/mlx5e: fix semicolon.cocci warnings
    staging: octeon: fix build failure with XFRM enabled
    net: Revert recent Spectre-v1 patches.
    can: af_can: Fix Spectre v1 vulnerability
    packet: validate address length if non-zero
    nfc: af_nfc: Fix Spectre v1 vulnerability
    phonet: af_phonet: Fix Spectre v1 vulnerability
    net: core: Fix Spectre v1 vulnerability
    net: minor cleanup in skb_ext_add()
    net: drop the unused helper skb_ext_get()
    ...

    Linus Torvalds
     

27 Dec, 2018

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     

25 Dec, 2018

2 commits


24 Dec, 2018

1 commit

  • This reverts:

    50d5258634ae ("net: core: Fix Spectre v1 vulnerability")
    d686026b1e6e ("phonet: af_phonet: Fix Spectre v1 vulnerability")
    a95386f0390a ("nfc: af_nfc: Fix Spectre v1 vulnerability")
    a3ac5817ffe8 ("can: af_can: Fix Spectre v1 vulnerability")

    After some discussion with Alexei Starovoitov these all seem to
    be completely unnecessary.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Dec, 2018

1 commit

  • flen is indirectly controlled by user-space, hence leading to
    a potential exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    net/core/filter.c:1101 bpf_check_classic() warn: potential spectre issue 'filter' [w]

    Fix this by sanitizing flen before using it to index filter at line 1101:

    switch (filter[flen - 1].code) {

    and through pc at line 1040:

    const struct sock_filter *ftest = &filter[pc];

    Notice that given that speculation windows are large, the policy is
    to kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

22 Dec, 2018

4 commits

  • David S. Miller
     
  • When the extension to be added is already present, the only
    skb field we may need to update is 'extensions': we can reorder
    the code and avoid a branch.

    v1 -> v2:
    - be sure to flag the newly added extension as active

    Signed-off-by: Paolo Abeni
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • On cow we can free the old extension: we must avoid dereferencing
    such extension after skb_ext_maybe_cow(). Since 'new' contents
    are always equal to 'old' after the copy, we can fix the above
    accessing the relevant data using 'new'.

    Fixes: df5042f4c5b9 ("sk_buff: add skb extension infrastructure")
    Signed-off-by: Paolo Abeni
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
    pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
    tls module while doing record encryption.

    Crash fixed by this patch.

    [ 78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    [ 78.804900] Mem abort info:
    [ 78.807683] ESR = 0x96000004
    [ 78.810744] Exception class = DABT (current EL), IL = 32 bits
    [ 78.816677] SET = 0, FnV = 0
    [ 78.819727] EA = 0, S1PTW = 0
    [ 78.822873] Data abort info:
    [ 78.825759] ISV = 0, ISS = 0x00000004
    [ 78.829600] CM = 0, WnR = 0
    [ 78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
    [ 78.839195] [0000000000000008] pgd=0000000000000000
    [ 78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
    [ 78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
    [ 78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da63145-dirty #107
    [ 78.874149] Hardware name: LS1043A RDB Board (DT)
    [ 78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
    [ 78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
    [ 78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
    [ 78.893366] sp : ffff00001d04b600
    [ 78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
    [ 78.901970] x27: 0000000000000000 x26: ffff80006c8de786
    [ 78.907272] x25: ffff00001d04b760 x24: 000000000000001a
    [ 78.912573] x23: 0000000000000006 x22: ffff80006814e440
    [ 78.917874] x21: 0000000000000100 x20: 0000000000000000
    [ 78.923175] x19: 000081ffffffffff x18: 0000000000000400
    [ 78.928476] x17: 0000000000000008 x16: 0000000000000000
    [ 78.933778] x15: 0000000000000100 x14: 0000000000000001
    [ 78.939079] x13: 0000000000001080 x12: 0000000000000020
    [ 78.944381] x11: 0000000000001080 x10: 00000000ffff0002
    [ 78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
    [ 78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
    [ 78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
    [ 78.965588] x3 : 0000000000000000 x2 : 0000000000001086
    [ 78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
    [ 78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
    [ 78.982968] Call trace:
    [ 78.985406] scatterwalk_copychunks+0x164/0x1c8
    [ 78.989927] skcipher_walk_next+0x28c/0x448
    [ 78.994099] skcipher_walk_done+0xfc/0x258
    [ 78.998187] gcm_encrypt+0x434/0x4c0
    [ 79.001758] tls_push_record+0x354/0xa58 [tls]
    [ 79.006194] bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
    [ 79.010978] tls_sw_sendmsg+0x650/0x780 [tls]
    [ 79.015326] inet_sendmsg+0x2c/0xf8
    [ 79.018806] sock_sendmsg+0x18/0x30
    [ 79.022284] __sys_sendto+0x104/0x138
    [ 79.025935] __arm64_sys_sendto+0x24/0x30
    [ 79.029936] el0_svc_common+0x60/0xe8
    [ 79.033588] el0_svc_handler+0x2c/0x80
    [ 79.037327] el0_svc+0x8/0xc
    [ 79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
    [ 79.046283] ---[ end trace 74db007d069c1cf7 ]---

    Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
    Signed-off-by: Vakul Garg
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Vakul Garg
     

21 Dec, 2018

8 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2018-12-21

    The following pull-request contains BPF updates for your *net-next* tree.

    There is a merge conflict in test_verifier.c. Result looks as follows:

    [...]
    },
    {
    "calls: cross frame pruning",
    .insns = {
    [...]
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .errstr_unpriv = "function calls to other bpf functions are allowed for root only",
    .result_unpriv = REJECT,
    .errstr = "!read_ok",
    .result = REJECT,
    },
    {
    "jset: functional",
    .insns = {
    [...]
    {
    "jset: unknown const compare not taken",
    .insns = {
    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
    BPF_FUNC_get_prandom_u32),
    BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
    BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
    BPF_EXIT_INSN(),
    },
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .errstr_unpriv = "!read_ok",
    .result_unpriv = REJECT,
    .errstr = "!read_ok",
    .result = REJECT,
    },
    [...]
    {
    "jset: range",
    .insns = {
    [...]
    },
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .result_unpriv = ACCEPT,
    .result = ACCEPT,
    },

    The main changes are:

    1) Various BTF related improvements in order to get line info
    working. Meaning, verifier will now annotate the corresponding
    BPF C code to the error log, from Martin and Yonghong.

    2) Implement support for raw BPF tracepoints in modules, from Matt.

    3) Add several improvements to verifier state logic, namely speeding
    up stacksafe check, optimizations for stack state equivalence
    test and safety checks for liveness analysis, from Alexei.

    4) Teach verifier to make use of BPF_JSET instruction, add several
    test cases to kselftests and remove nfp specific JSET optimization
    now that verifier has awareness, from Jakub.

    5) Improve BPF verifier's slot_type marking logic in order to
    allow more stack slot sharing, from Jiong.

    6) Add sk_msg->size member for context access and add set of fixes
    and improvements to make sock_map with kTLS usable with openssl
    based applications, from John.

    7) Several cleanups and documentation updates in bpftool as well as
    auto-mount of tracefs for "bpftool prog tracelog" command,
    from Quentin.

    8) Include sub-program tags from now on in bpf_prog_info in order to
    have a reliable way for user space to get all tags of the program
    e.g. needed for kallsyms correlation, from Song.

    9) Add BTF annotations for cgroup_local_storage BPF maps and
    implement bpf fs pretty print support, from Roman.

    10) Fix bpftool in order to allow for cross-compilation, from Ivan.

    11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
    to be compatible with libbfd and allow for Debian packaging,
    from Jakub.

    12) Remove an obsolete prog->aux sanitation in dump and get rid of
    version check for prog load, from Daniel.

    13) Fix a memory leak in libbpf's line info handling, from Prashant.

    14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
    does not get unaligned, from Jesper.

    15) Fix test_progs kselftest to work with older compilers which are less
    smart in optimizing (and thus throwing build error), from Stanislav.

    16) Cleanup and simplify AF_XDP socket teardown, from Björn.

    17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
    to netns_id argument, from Andrey.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Currently the stray semicolon means that the final term in the addition
    is being missed. Fix this by removing it. Cleans up clang warning:

    net/core/neighbour.c:2821:9: warning: expression result unused [-Wunused-value]

    Fixes: 82cbb5c631a0 ("neighbour: register rtnl doit handler")
    Signed-off-by: Colin Ian King
    Acked-By: Roopa Prabhu
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • In addition to releasing any cork'ed data on a psock when the psock
    is removed we should also release any skb's in the ingress work queue.
    Otherwise the skb's eventually get free'd but late in the tear
    down process so we see the WARNING due to non-zero sk_forward_alloc.

    void sk_stream_kill_queues(struct sock *sk)
    {
    ...
    WARN_ON(sk->sk_forward_alloc);
    ...
    }

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • When a skb verdict program is in-use and either another BPF program
    redirects to that socket or the new SK_PASS support is used the
    data_ready callback does not wake up application. Instead because
    the stream parser/verdict is using the sk data_ready callback we wake
    up the stream parser/verdict block.

    Fix this by adding a helper to check if the stream parser block is
    enabled on the sk and if so call the saved pointer which is the
    upper layers wake up function.

    This fixes application stalls observed when an application is waiting
    for data in a blocking read().

    Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • Add SK_PASS verdict support to SK_SKB_VERDICT programs. Now that
    support for redirects exists we can implement SK_PASS as a redirect
    to the same socket. This simplifies the BPF programs and avoids an
    extra map lookup on RX path for simple visibility cases.

    Further, reduces user (BPF programmer in this context) confusion
    when their program drops skb due to lack of support.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • Enforce comment on structure layout dependency with a BUILD_BUG_ON
    to ensure the condition is maintained.

    Suggested-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • The check for max offset in sk_msg_is_valid_access uses sizeof()
    which is incorrect because it would allow accessing possibly
    past the end of the struct in the padded case. Further, it doesn't
    preclude accessing any padding that may be added in the middle of
    a struct. All told this makes it fragile to rely on.

    To fix this explicitly check offsets with fields using the
    bpf_ctx_range() and bpf_ctx_range_till() macros.

    For reference the current structure layout looks as follows (reported
    by pahole)

    struct sk_msg_md {
    union {
    void * data; /* 8 */
    }; /* 0 8 */
    union {
    void * data_end; /* 8 */
    }; /* 8 8 */
    __u32 family; /* 16 4 */
    __u32 remote_ip4; /* 20 4 */
    __u32 local_ip4; /* 24 4 */
    __u32 remote_ip6[4]; /* 28 16 */
    __u32 local_ip6[4]; /* 44 16 */
    __u32 remote_port; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    __u32 local_port; /* 64 4 */
    __u32 size; /* 68 4 */

    /* size: 72, cachelines: 2, members: 10 */
    /* last cacheline: 8 bytes */
    };

    So there should be no padding at the moment but fixing this now
    prevents future errors.

    Reported-by: Alexei Starovoitov
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • Lots of conflicts, by happily all cases of overlapping
    changes, parallel adds, things of that nature.

    Thanks to Stephen Rothwell, Saeed Mahameed, and others
    for their guidance in these resolutions.

    Signed-off-by: David S. Miller

    David S. Miller