02 Sep, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Fix some length checks during OGM processing in batman-adv, from
    Sven Eckelmann.

    2) Fix regression that caused netfilter conntrack sysctls to not be
    per-netns any more. From Florian Westphal.

    3) Use after free in netpoll, from Feng Sun.

    4) Guard destruction of pfifo_fast per-cpu qdisc stats with
    qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
    in pfifo_fast_enqueue().

    5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.

    6) Handle neigh events on internal ports correctly in nfp, from John
    Hurley.

    7) Clear SKB timestamp in NF flow table code so that it does not
    confuse fq scheduler. From Florian Westphal.

    8) taprio destroy can crash if it is invoked in a failure path of
    taprio_init(), because the list head isn't setup properly yet and
    the list del is unconditional. Perform the list add earlier to
    address this. From Vladimir Oltean.

    9) Make sure to reapply vlan filters on device up, in aquantia driver.
    From Dmitry Bogdanov.

    10) sgiseeq driver releases DMA memory using free_page() instead of
    dma_free_attrs(). From Christophe JAILLET.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
    net: seeq: Fix the function used to release some memory in an error handling path
    enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
    net: bcmgenet: use ethtool_op_get_ts_info()
    tc-testing: don't hardcode 'ip' in nsPlugin.py
    net: dsa: microchip: add KSZ8563 compatibility string
    dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
    net: aquantia: fix out of memory condition on rx side
    net: aquantia: linkstate irq should be oneshot
    net: aquantia: reapply vlan filters on up
    net: aquantia: fix limit of vlan filters
    net: aquantia: fix removal of vlan 0
    net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
    taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
    taprio: Fix kernel panic in taprio_destroy
    net: dsa: microchip: fill regmap_config name
    rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
    net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
    amd-xgbe: Fix error path in xgbe_mod_init()
    netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
    mac80211: Correctly set noencrypt for PAE frames
    ...

    Linus Torvalds
     

01 Sep, 2019

4 commits

  • The discussion to be made is absolutely the same as in the case of
    previous patch ("taprio: Set default link speed to 10 Mbps in
    taprio_set_picos_per_byte"). Nothing is lost when setting a default.

    Cc: Leandro Dorileo
    Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation")
    Acked-by: Vinicius Costa Gomes
    Signed-off-by: Vladimir Oltean
    Signed-off-by: David S. Miller

    Vladimir Oltean
     
  • The taprio budget needs to be adapted at runtime according to interface
    link speed. But that handling is problematic.

    For one thing, installing a qdisc on an interface that doesn't have
    carrier is not illegal. But taprio prints the following stack trace:

    [ 31.851373] ------------[ cut here ]------------
    [ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4
    [ 31.864566] taprio: dequeue() called with unknown picos per byte.
    [ 31.864570] Modules linked in:
    [ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689
    [ 31.881398] Hardware name: Freescale LS1021A
    [ 31.885661] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [ 31.893368] [] (show_stack) from [] (dump_stack+0xb4/0xc8)
    [ 31.900555] [] (dump_stack) from [] (__warn+0xe0/0xf8)
    [ 31.907395] [] (__warn) from [] (warn_slowpath_fmt+0x48/0x6c)
    [ 31.914841] [] (warn_slowpath_fmt) from [] (taprio_dequeue+0x1a8/0x2d4)
    [ 31.923150] [] (taprio_dequeue) from [] (__qdisc_run+0x90/0x61c)
    [ 31.930856] [] (__qdisc_run) from [] (net_tx_action+0x12c/0x2bc)
    [ 31.938560] [] (net_tx_action) from [] (__do_softirq+0x130/0x3c8)
    [ 31.946350] [] (__do_softirq) from [] (irq_exit+0xbc/0xd8)
    [ 31.953536] [] (irq_exit) from [] (__handle_domain_irq+0x60/0xb4)
    [ 31.961328] [] (__handle_domain_irq) from [] (gic_handle_irq+0x58/0x9c)
    [ 31.969638] [] (gic_handle_irq) from [] (__irq_svc+0x6c/0x90)
    [ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68)
    [ 31.982100] 7b20: e9d4bd80 00000cc0 000000cf 00000000 e9d4bd80 c1f38958 00000cc0 c1f38960
    [ 31.990234] 7b40: 00000001 000000cf 00000004 e9dc0800 00000000 e8167b70 c0f478ec c0f46d94
    [ 31.998363] 7b60: 60070013 ffffffff
    [ 32.001833] [] (__irq_svc) from [] (netlink_trim+0x18/0xd8)
    [ 32.009104] [] (netlink_trim) from [] (netlink_broadcast_filtered+0x34/0x414)
    [ 32.017930] [] (netlink_broadcast_filtered) from [] (netlink_broadcast+0x20/0x28)
    [ 32.027102] [] (netlink_broadcast) from [] (rtnetlink_send+0x34/0x88)
    [ 32.035238] [] (rtnetlink_send) from [] (notify_and_destroy+0x2c/0x44)
    [ 32.043461] [] (notify_and_destroy) from [] (qdisc_graft+0x398/0x470)
    [ 32.051595] [] (qdisc_graft) from [] (tc_modify_qdisc+0x3a4/0x724)
    [ 32.059470] [] (tc_modify_qdisc) from [] (rtnetlink_rcv_msg+0x260/0x2ec)
    [ 32.067864] [] (rtnetlink_rcv_msg) from [] (netlink_rcv_skb+0xb8/0x110)
    [ 32.076172] [] (netlink_rcv_skb) from [] (netlink_unicast+0x1b4/0x22c)
    [ 32.084392] [] (netlink_unicast) from [] (netlink_sendmsg+0x33c/0x380)
    [ 32.092614] [] (netlink_sendmsg) from [] (sock_sendmsg+0x14/0x24)
    [ 32.100403] [] (sock_sendmsg) from [] (___sys_sendmsg+0x214/0x228)
    [ 32.108279] [] (___sys_sendmsg) from [] (__sys_sendmsg+0x50/0x8c)
    [ 32.116068] [] (__sys_sendmsg) from [] (ret_fast_syscall+0x0/0x54)
    [ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0)
    [ 32.128960] 7fa0: b6fa68c8 000000f8 00000003 bea142d0 00000000 00000000
    [ 32.137093] 7fc0: b6fa68c8 000000f8 0052154c 00000128 5d6468a2 00000000 00000028 00558c9c
    [ 32.145224] 7fe0: 00000070 bea14278 00530d64 b6e17e64
    [ 32.150659] ---[ end trace 2139c9827c3e5177 ]---

    This happens because the qdisc ->dequeue callback gets called. Which
    again is not illegal, the qdisc will dequeue even when the interface is
    up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames
    will be dropped further down the stack in dev_direct_xmit().

    And, at the end of the day, for what? For calculating the initial budget
    of an interface which is non-operational at the moment and where frames
    will get dropped anyway.

    So if we can't figure out the link speed, default to SPEED_10 and move
    along. We can also remove the runtime check now.

    Cc: Leandro Dorileo
    Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
    Acked-by: Vinicius Costa Gomes
    Signed-off-by: Vladimir Oltean
    Signed-off-by: David S. Miller

    Vladimir Oltean
     
  • taprio_init may fail earlier than this line:

    list_add(&q->taprio_list, &taprio_list);

    i.e. due to the net device not being multi queue.

    Attempting to remove q from the global taprio_list when it is not part
    of it will result in a kernel panic.

    Fix it by matching list_add and list_del better to one another in the
    order of operations. This way we can keep the deletion unconditional
    and with lower complexity - O(1).

    Cc: Leandro Dorileo
    Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
    Signed-off-by: Vladimir Oltean
    Acked-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vladimir Oltean
     
  • Simon Wunderlich says:

    ====================
    Here are two batman-adv bugfixes:

    - Fix OGM and OGMv2 header read boundary check,
    by Sven Eckelmann (2 patches)
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

31 Aug, 2019

4 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Spurious warning when loading rules using the physdev match,
    from Todd Seidelmann.

    2) Fix FTP conntrack helper debugging output, from Thomas Jarosch.

    3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs,
    from Florian Westphal.

    4) Clear skbuff timestamp from the flowtable datapath, also from Florian.

    5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When a local endpoint is ceases to be in use, such as when the kafs module
    is unloaded, the kernel will emit an assertion failure if there are any
    outstanding client connections:

    rxrpc: Assertion failed
    ------------[ cut here ]------------
    kernel BUG at net/rxrpc/local_object.c:433!

    and even beyond that, will evince other oopses if there are service
    connections still present.

    Fix this by:

    (1) Removing the triggering of connection reaping when an rxrpc socket is
    released. These don't actually clean up the connections anyway - and
    further, the local endpoint may still be in use through another
    socket.

    (2) Mark the local endpoint as dead when we start the process of tearing
    it down.

    (3) When destroying a local endpoint, strip all of its client connections
    from the idle list and discard the ref on each that the list was
    holding.

    (4) When destroying a local endpoint, call the service connection reaper
    directly (rather than through a workqueue) to immediately kill off all
    outstanding service connections.

    (5) Make the service connection reaper reap connections for which the
    local endpoint is marked dead.

    Only after destroying the connections can we close the socket lest we get
    an oops in a workqueue that's looking at a connection or a peer.

    Fixes: 3d18cbb7fd0c ("rxrpc: Fix conn expiry timers")
    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     
  • David Howells says:

    ====================
    rxrpc: Fix use of skb_cow_data()

    Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
    with skb_unshare() early on in the input process. The problem that is
    being seen is that skb_cow_data() indirectly requires that the maximum
    usage count on an sk_buff be 1, and it may generate an assertion failure in
    pskb_expand_head() if not.

    This can occur because rxrpc_input_data() may be still holding a ref when
    it has just attached the sk_buff to the rx ring and given that attachment
    its own ref. If recvmsg happens fast enough, skb_cow_data() can see the
    ref still held by the softirq handler.

    Further, a packet may contain multiple subpackets, each of which gets its
    own attachment to the ring and its own ref - also making skb_cow_data() go
    bang.

    Fix this by:

    (1) The DATA packet is currently parsed for subpackets twice by the input
    routines. Parse it just once instead and make notes in the sk_buff
    private data.

    (2) Use the notes from (1) when attaching the packet to the ring multiple
    times. Once the packet is attached to the ring, recvmsg can see it
    and start modifying it, so the softirq handler is not permitted to
    look inside it from that point.

    (3) Pass the ref from the input code to the ring rather than getting an
    extra ref. rxrpc_input_data() uses a ref on the second refcount to
    prevent the packet from evaporating under it.

    (4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
    before we take call->input_lock. Other sorts of packets don't get
    modified and so can be left.

    A trace is emitted if skb_unshare() eats the skb. Note that
    skb_share() for our accounting in this regard as we can't see the
    parameters in the packet to log in a trace line if it releases it.

    (5) Remove the calls to skb_cow_data(). These are then no longer
    necessary.

    There are also patches to improve the rxrpc_skb tracepoint to make sure
    that Tx-derived buffers are identified separately from Rx-derived buffers
    in the trace.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull two ceph fixes from Ilya Dryomov:
    "A fix for a -rc1 regression in rbd and a trivial static checker fix"

    * tag 'ceph-for-5.3-rc7' of git://github.com/ceph/ceph-client:
    rbd: restore zeroing past the overlap when reading from parent
    libceph: don't call crypto_free_sync_skcipher() on a NULL tfm

    Linus Torvalds
     

30 Aug, 2019

1 commit


29 Aug, 2019

10 commits

  • The noencrypt flag was intended to be set if the "frame was received
    unencrypted" according to include/uapi/linux/nl80211.h. However, the
    current behavior is opposite of this.

    Cc: stable@vger.kernel.org
    Fixes: 018f6fbf540d ("mac80211: Send control port frames over nl80211")
    Signed-off-by: Denis Kenzior
    Link: https://lore.kernel.org/r/20190827224120.14545-3-denkenz@gmail.com
    Signed-off-by: Johannes Berg

    Denis Kenzior
     
  • In ieee80211_deliver_skb_to_local_stack intercepts EAPoL frames if
    mac80211 is configured to do so and forwards the contents over nl80211.
    During this process some additional data is also forwarded, including
    whether the frame was received encrypted or not. Unfortunately just
    prior to the call to ieee80211_deliver_skb_to_local_stack, skb->cb is
    cleared, resulting in incorrect data being exposed over nl80211.

    Fixes: 018f6fbf540d ("mac80211: Send control port frames over nl80211")
    Cc: stable@vger.kernel.org
    Signed-off-by: Denis Kenzior
    Link: https://lore.kernel.org/r/20190827224120.14545-2-denkenz@gmail.com
    Signed-off-by: Johannes Berg

    Denis Kenzior
     
  • If 'fq' qdisc is used and a program has requested timestamps,
    skb->tstamp needs to be cleared, else fq will treat these as
    'transmit time'.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
    'TCQ_F_NOLOCK' bit in the parent qdisc, we can't assume anymore that
    per-cpu counters are there in the error path of skb_array_produce().
    Otherwise, the following splat can be seen:

    Unable to handle kernel paging request at virtual address 0000600dea430008
    Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
    user pgtable: 64k pages, 48-bit VAs, pgdp = 000000007b97530e
    [0000600dea430008] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] SMP
    [...]
    pstate: 10000005 (nzcV daif -PAN -UAO)
    pc : pfifo_fast_enqueue+0x524/0x6e8
    lr : pfifo_fast_enqueue+0x46c/0x6e8
    sp : ffff800d39376fe0
    x29: ffff800d39376fe0 x28: 1ffff001a07d1e40
    x27: ffff800d03e8f188 x26: ffff800d03e8f200
    x25: 0000000000000062 x24: ffff800d393772f0
    x23: 0000000000000000 x22: 0000000000000403
    x21: ffff800cca569a00 x20: ffff800d03e8ee00
    x19: ffff800cca569a10 x18: 00000000000000bf
    x17: 0000000000000000 x16: 0000000000000000
    x15: 0000000000000000 x14: ffff1001a726edd0
    x13: 1fffe4000276a9a4 x12: 0000000000000000
    x11: dfff200000000000 x10: ffff800d03e8f1a0
    x9 : 0000000000000003 x8 : 0000000000000000
    x7 : 00000000f1f1f1f1 x6 : ffff1001a726edea
    x5 : ffff800cca56a53c x4 : 1ffff001bf9a8003
    x3 : 1ffff001bf9a8003 x2 : 1ffff001a07d1dcb
    x1 : 0000600dea430000 x0 : 0000600dea430008
    Process ping (pid: 6067, stack limit = 0x00000000dc0aa557)
    Call trace:
    pfifo_fast_enqueue+0x524/0x6e8
    htb_enqueue+0x660/0x10e0 [sch_htb]
    __dev_queue_xmit+0x123c/0x2de0
    dev_queue_xmit+0x24/0x30
    ip_finish_output2+0xc48/0x1720
    ip_finish_output+0x548/0x9d8
    ip_output+0x334/0x788
    ip_local_out+0x90/0x138
    ip_send_skb+0x44/0x1d0
    ip_push_pending_frames+0x5c/0x78
    raw_sendmsg+0xed8/0x28d0
    inet_sendmsg+0xc4/0x5c0
    sock_sendmsg+0xac/0x108
    __sys_sendto+0x1ac/0x2a0
    __arm64_sys_sendto+0xc4/0x138
    el0_svc_handler+0x13c/0x298
    el0_svc+0x8/0xc
    Code: f9402e80 d538d081 91002000 8b010000 (885f7c03)

    Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
    before dereferencing 'qdisc->cpu_qstats'.

    Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
    CC: Paolo Abeni
    CC: Stefano Brivio
    Reported-by: Li Shuang
    Signed-off-by: Davide Caratti
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • TCP associates tx timestamp requests with a byte in the bytestream.
    If merging skbs in tcp_mtu_probe, migrate the tstamp request.

    Similar to MSG_EOR, do not allow moving a timestamp from any segment
    in the probe but the last. This to avoid merging multiple timestamps.

    Tested with the packetdrill script at
    https://github.com/wdebruij/packetdrill/commits/mtu_probe-1

    Link: http://patchwork.ozlabs.org/patch/1143278/#2232897
    Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Action sample doesn't properly handle psample_group pointer in overwrite
    case. Following issues need to be fixed:

    - In tcf_sample_init() function RCU_INIT_POINTER() is used to set
    s->psample_group, even though we neither setting the pointer to NULL, nor
    preventing concurrent readers from accessing the pointer in some way.
    Use rcu_swap_protected() instead to safely reset the pointer.

    - Old value of s->psample_group is not released or deallocated in any way,
    which results resource leak. Use psample_group_put() on non-NULL value
    obtained with rcu_swap_protected().

    - The function psample_group_put() that released reference to struct
    psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
    grace period when deallocating it. Extend struct psample_group with rcu
    head and use kfree_rcu when freeing it.

    Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Only the first fragment in a datagram contains the L4 headers. When the
    Open vSwitch module parses a packet, it always sets the IP protocol
    field in the key, but can only set the L4 fields on the first fragment.
    The original behavior would not clear the L4 portion of the key, so
    garbage values would be sent in the key for "later" fragments. This
    patch clears the L4 fields in that circumstance to prevent sending those
    garbage values as part of the upcall.

    Signed-off-by: Justin Pettit
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Justin Pettit
     
  • When IP fragments are reassembled before being sent to conntrack, the
    key from the last fragment is used. Unless there are reordering
    issues, the last fragment received will not contain the L4 ports, so the
    key for the reassembled datagram won't contain them. This patch updates
    the key once we have a reassembled datagram.

    The handle_fragments() function works on L3 headers so we pull the L3/L4
    flow key update code from key_extract into a new function
    'key_extract_l3l4'. Then we add a another new function
    ovs_flow_key_update_l3l4() and export it so that it is accessible by
    handle_fragments() for conntrack packet reassembly.

    Co-authored-by: Justin Pettit
    Signed-off-by: Greg Rose
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Greg Rose
     
  • Similar to the fix done for IPv4 in commit e5b1c6c6277d
    ("igmp: fix memory leak in igmpv3_del_delrec()"), we need to
    make sure mca_tomb and mca_sources are not blindly overwritten.

    Using swap() then a call to ip6_mc_clear_src() will take care
    of the missing free.

    BUG: memory leak
    unreferenced object 0xffff888117d9db00 (size 64):
    comm "syz-executor247", pid 6918, jiffies 4294943989 (age 25.350s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 fe 88 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [] slab_post_alloc_hook mm/slab.h:522 [inline]
    [] slab_alloc mm/slab.c:3319 [inline]
    [] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3548
    [] kmalloc include/linux/slab.h:552 [inline]
    [] kzalloc include/linux/slab.h:748 [inline]
    [] ip6_mc_add1_src net/ipv6/mcast.c:2236 [inline]
    [] ip6_mc_add_src+0x31f/0x420 net/ipv6/mcast.c:2356
    [] ip6_mc_source+0x4a8/0x600 net/ipv6/mcast.c:449
    [] do_ipv6_setsockopt.isra.0+0x1b92/0x1dd0 net/ipv6/ipv6_sockglue.c:748
    [] ipv6_setsockopt+0x89/0xd0 net/ipv6/ipv6_sockglue.c:944
    [] udpv6_setsockopt+0x4e/0x90 net/ipv6/udp.c:1558
    [] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3139
    [] __sys_setsockopt+0x10f/0x220 net/socket.c:2084
    [] __do_sys_setsockopt net/socket.c:2100 [inline]
    [] __se_sys_setsockopt net/socket.c:2097 [inline]
    [] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2097
    [] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
    Fixes: 9c8bb163ae78 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
    'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
    counters are present when 'reset()' is called for pfifo_fast qdiscs.
    Otherwise, the following script:

    # tc q a dev lo handle 1: root htb default 100
    # tc c a dev lo parent 1: classid 1:100 htb \
    > rate 95Mbit ceil 100Mbit burst 64k
    [...]
    # tc f a dev lo parent 1: protocol arp basic classid 1:100
    [...]
    # tc q a dev lo parent 1:100 handle 100: pfifo_fast
    [...]
    # tc q d dev lo root

    can generate the following splat:

    Unable to handle kernel paging request at virtual address dfff2c01bd148000
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    [dfff2c01bd148000] address between user and kernel address ranges
    Internal error: Oops: 96000004 [#1] SMP
    [...]
    pstate: 80000005 (Nzcv daif -PAN -UAO)
    pc : pfifo_fast_reset+0x280/0x4d8
    lr : pfifo_fast_reset+0x21c/0x4d8
    sp : ffff800d09676fa0
    x29: ffff800d09676fa0 x28: ffff200012ee22e4
    x27: dfff200000000000 x26: 0000000000000000
    x25: ffff800ca0799958 x24: ffff1001940f332b
    x23: 0000000000000007 x22: ffff200012ee1ab8
    x21: 0000600de8a40000 x20: 0000000000000000
    x19: ffff800ca0799900 x18: 0000000000000000
    x17: 0000000000000002 x16: 0000000000000000
    x15: 0000000000000000 x14: 0000000000000000
    x13: 0000000000000000 x12: ffff1001b922e6e2
    x11: 1ffff001b922e6e1 x10: 0000000000000000
    x9 : 1ffff001b922e6e1 x8 : dfff200000000000
    x7 : 0000000000000000 x6 : 0000000000000000
    x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
    x3 : 00000c01bd148000 x2 : 0000600de8a40000
    x1 : 0000000000000007 x0 : 0000600de8a40004
    Call trace:
    pfifo_fast_reset+0x280/0x4d8
    qdisc_reset+0x6c/0x370
    htb_reset+0x150/0x3b8 [sch_htb]
    qdisc_reset+0x6c/0x370
    dev_deactivate_queue.constprop.5+0xe0/0x1a8
    dev_deactivate_many+0xd8/0x908
    dev_deactivate+0xe4/0x190
    qdisc_graft+0x88c/0xbd0
    tc_get_qdisc+0x418/0x8a8
    rtnetlink_rcv_msg+0x3a8/0xa78
    netlink_rcv_skb+0x18c/0x328
    rtnetlink_rcv+0x28/0x38
    netlink_unicast+0x3c4/0x538
    netlink_sendmsg+0x538/0x9a0
    sock_sendmsg+0xac/0xf8
    ___sys_sendmsg+0x53c/0x658
    __sys_sendmsg+0xc8/0x140
    __arm64_sys_sendmsg+0x74/0xa8
    el0_svc_handler+0x164/0x468
    el0_svc+0x10/0x14
    Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)

    Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
    before dereferencing 'qdisc->cpu_qstats'.

    Changes since v1:
    - coding style improvements, thanks to Stefano Brivio

    Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
    CC: Paolo Abeni
    Reported-by: Li Shuang
    Signed-off-by: Davide Caratti
    Acked-by: Paolo Abeni
    Reviewed-by: Stefano Brivio
    Signed-off-by: David S. Miller

    Davide Caratti
     

28 Aug, 2019

8 commits

  • In set_secret(), key->tfm is assigned to NULL on line 55, and then
    ceph_crypto_key_destroy(key) is executed.

    ceph_crypto_key_destroy(key)
    crypto_free_sync_skcipher(key->tfm)
    crypto_free_skcipher(&tfm->base);

    This happens to work because crypto_sync_skcipher is a trivial wrapper
    around crypto_skcipher: &tfm->base is still 0 and crypto_free_skcipher()
    handles that. Let's not rely on the layout of crypto_sync_skcipher.

    This bug is found by a static analysis tool STCheck written by us.

    Fixes: 69d6302b65a8 ("libceph: Remove VLA usage of skcipher").
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jia-Ju Bai
     
  • Vladimir Rutsky reported stuck TCP sessions after memory pressure
    events. Edge Trigger epoll() user would never receive an EPOLLOUT
    notification allowing them to retry a sendmsg().

    Jason tested the case of sk_stream_alloc_skb() returning NULL,
    but there are other paths that could lead both sendmsg() and sendpage()
    to return -1 (EAGAIN), with an empty skb queued on the write queue.

    This patch makes sure we remove this empty skb so that
    Jason code can detect that the queue is empty, and
    call sk->sk_write_space(sk) accordingly.

    Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
    Signed-off-by: Eric Dumazet
    Cc: Jason Baron
    Reported-by: Vladimir Rutsky
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The rds6_inc_info_copy() function has a couple struct members which
    are leaking stack information. The ->tos field should hold actual
    information and the ->flags field needs to be zeroed out.

    Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure")
    Fixes: b7ff8b1036f0 ("rds: Extend RDS API for IPv6 support")
    Reported-by: 黄ID蝴蝶
    Signed-off-by: Dan Carpenter
    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     
  • After commit baeababb5b85d5c4e6c917efe2a1504179438d3b
    ("tun: return NET_XMIT_DROP for dropped packets"),
    when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP,
    netpoll_send_skb_on_dev will run into following use after free cases:
    1. retry netpoll_start_xmit with freed skb;
    2. queue freed skb in npinfo->txq.
    queue_process will also run into use after free case.

    hit netpoll_send_skb_on_dev first case with following kernel log:

    [ 117.864773] kernel BUG at mm/slub.c:306!
    [ 117.864773] invalid opcode: 0000 [#1] SMP PTI
    [ 117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P OE 5.3.0-050300rc5-generic #201908182231
    [ 117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    [ 117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0
    [ 117.864781] Call Trace:
    [ 117.864781] ? tun_net_xmit+0x21c/0x460
    [ 117.864781] kfree_skbmem+0x4e/0x60
    [ 117.864782] kfree_skb+0x3a/0xa0
    [ 117.864782] tun_net_xmit+0x21c/0x460
    [ 117.864782] netpoll_start_xmit+0x11d/0x1b0
    [ 117.864788] netpoll_send_skb_on_dev+0x1b8/0x200
    [ 117.864789] __br_forward+0x1b9/0x1e0 [bridge]
    [ 117.864789] ? skb_clone+0x53/0xd0
    [ 117.864790] ? __skb_clone+0x2e/0x120
    [ 117.864790] deliver_clone+0x37/0x50 [bridge]
    [ 117.864790] maybe_deliver+0x89/0xc0 [bridge]
    [ 117.864791] br_flood+0x6c/0x130 [bridge]
    [ 117.864791] br_dev_xmit+0x315/0x3c0 [bridge]
    [ 117.864792] netpoll_start_xmit+0x11d/0x1b0
    [ 117.864792] netpoll_send_skb_on_dev+0x1b8/0x200
    [ 117.864792] netpoll_send_udp+0x2c6/0x3e8
    [ 117.864793] write_msg+0xd9/0xf0 [netconsole]
    [ 117.864793] console_unlock+0x386/0x4e0
    [ 117.864793] vprintk_emit+0x17e/0x280
    [ 117.864794] vprintk_default+0x29/0x50
    [ 117.864794] vprintk_func+0x4c/0xbc
    [ 117.864794] printk+0x58/0x6f
    [ 117.864795] loop_fun+0x24/0x41 [printmsg_loop]
    [ 117.864795] kthread+0x104/0x140
    [ 117.864795] ? 0xffffffffc05b1000
    [ 117.864796] ? kthread_park+0x80/0x80
    [ 117.864796] ret_from_fork+0x35/0x40

    Signed-off-by: Feng Sun
    Signed-off-by: Xiaojun Zhao
    Signed-off-by: David S. Miller

    Feng Sun
     
  • After witnessing the discussion in https://lkml.org/lkml/2019/8/14/151
    w.r.t. ioctl extensibility, it became clear that such an issue might
    prevent that the 3 RSV bits inside the DSA 802.1Q tag might also suffer
    the same fate and be useless for further extension.

    So clearly specify that the reserved bits should currently be
    transmitted as zero and ignored on receive. The DSA tagger already does
    this (and has always did), and is the only known user so far (no
    Wireshark dissection plugin, etc). So there should be no incompatibility
    to speak of.

    Fixes: 0471dd429cea ("net: dsa: tag_8021q: Create a stable binary format")
    Signed-off-by: Vladimir Oltean
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Vladimir Oltean
     
  • The net pointer in struct xt_tgdtor_param is not explicitly
    initialized therefore is still NULL when dereferencing it.
    So we have to find a way to pass the correct net pointer to
    ipt_destroy_target().

    The best way I find is just saving the net pointer inside the per
    netns struct tcf_idrinfo, which could make this patch smaller.

    Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
    Reported-and-tested-by: itugrok@yahoo.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable fixes:

    - Fix a page lock leak in nfs_pageio_resend()

    - Ensure O_DIRECT reports an error if the bytes read/written is 0

    - Don't handle errors if the bind/connect succeeded

    - Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was
    invalidat ed"

    Bugfixes:

    - Don't refresh attributes with mounted-on-file information

    - Fix return values for nfs4_file_open() and nfs_finish_open()

    - Fix pnfs layoutstats reporting of I/O errors

    - Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort
    for soft I/O errors when the user specifies a hard mount.

    - Various fixes to the error handling in sunrpc

    - Don't report writepage()/writepages() errors twice"

    * tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: remove set but not used variable 'mapping'
    NFSv2: Fix write regression
    NFSv2: Fix eof handling
    NFS: Fix writepage(s) error handling to not report errors twice
    NFS: Fix spurious EIO read errors
    pNFS/flexfiles: Don't time out requests on hard mounts
    SUNRPC: Handle connection breakages correctly in call_status()
    Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"
    SUNRPC: Handle EADDRINUSE and ENOBUFS correctly
    pNFS/flexfiles: Turn off soft RPC calls
    SUNRPC: Don't handle errors if the bind/connect succeeded
    NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()
    NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup
    NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0
    NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend()
    NFSv4: Fix return value in nfs_finish_open()
    NFSv4: Fix return values for nfs4_file_open()
    NFS: Don't refresh attributes with mounted-on-file information

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
    Leoshkevich.

    2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
    SMC from Jason Baron.

    3) ipv6_mc_may_pull() should return 0 for malformed packets, not
    -EINVAL. From Stefano Brivio.

    4) Don't forget to unpin umem xdp pages in error path of
    xdp_umem_reg(). From Ivan Khoronzhuk.

    5) Fix sta object leak in mac80211, from Johannes Berg.

    6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
    switches. From Florian Fainelli.

    7) Revert DMA sync removal from r8169 which was causing regressions on
    some MIPS Loongson platforms. From Heiner Kallweit.

    8) Use after free in flow dissector, from Jakub Sitnicki.

    9) Fix NULL derefs of net devices during ICMP processing across
    collect_md tunnels, from Hangbin Liu.

    10) proto_register() memory leaks, from Zhang Lin.

    11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
    from John Fastabend.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
    r8152: Set memory to all 0xFFs on failed reg reads
    openvswitch: Fix conntrack cache with timeout
    ipv4: mpls: fix mpls_xmit for iptunnel
    nexthop: Fix nexthop_num_path for blackhole nexthops
    net: rds: add service level support in rds-info
    net: route dump netlink NLM_F_MULTI flag missing
    s390/qeth: reject oversized SNMP requests
    sock: fix potential memory leak in proto_register()
    MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
    xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
    ipv4/icmp: fix rt dst dev null pointer dereference
    openvswitch: Fix log message in ovs conntrack
    bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
    bpf: fix use after free in prog symbol exposure
    bpf: fix precision tracking in presence of bpf2bpf calls
    flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
    Revert "r8169: remove not needed call to dma_sync_single_for_device"
    ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
    net/ncsi: Fix the payload copying for the request coming from Netlink
    qed: Add cleanup in qed_slowpath_start()
    ...

    Linus Torvalds
     

27 Aug, 2019

12 commits

  • When I merged the extension sysctl tables with the main one I forgot to
    reset them on netns creation. They currently read/write init_net settings.

    Fixes: d912dec12428 ("netfilter: conntrack: merge acct and helper sysctl table with main one")
    Fixes: cb2833ed0044 ("netfilter: conntrack: merge ecache and timestamp sysctl tables with main one")
    Reported-by: Shmulik Ladkani
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The find_pattern() debug output was printing the 'skip' character.
    This can be a NULL-byte and messes up further pr_debug() output.

    Output without the fix:
    kernel: nf_conntrack_ftp: Pattern matches!
    kernel: nf_conntrack_ftp: Skipped up to `nf_conntrack_ftp: find_pattern `PORT': dlen = 8
    kernel: nf_conntrack_ftp: find_pattern `EPRT': dlen = 8

    Output with the fix:
    kernel: nf_conntrack_ftp: Pattern matches!
    kernel: nf_conntrack_ftp: Skipped up to 0x0 delimiter!
    kernel: nf_conntrack_ftp: Match succeeded!
    kernel: nf_conntrack_ftp: conntrack_ftp: match `172,17,0,100,200,207' (20 bytes at 4150681645)
    kernel: nf_conntrack_ftp: find_pattern `PORT': dlen = 8

    Signed-off-by: Thomas Jarosch
    Signed-off-by: Pablo Neira Ayuso

    Thomas Jarosch
     
  • Simplify the check in physdev_mt_check() to emit an error message
    only when passed an invalid chain (ie, NF_INET_LOCAL_OUT).
    This avoids cluttering up the log with errors against valid rules.

    For large/heavily modified rulesets, current behavior can quickly
    overwhelm the ring buffer, because this function gets called on
    every change, regardless of the rule that was changed.

    Signed-off-by: Todd Seidelmann
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Todd Seidelmann
     
  • The in-place decryption routines in AF_RXRPC's rxkad security module
    currently call skb_cow_data() to make sure the data isn't shared and that
    the skb can be written over. This has a problem, however, as the softirq
    handler may be still holding a ref or the Rx ring may be holding multiple
    refs when skb_cow_data() is called in rxkad_verify_packet() - and so
    skb_shared() returns true and __pskb_pull_tail() dislikes that. If this
    occurs, something like the following report will be generated.

    kernel BUG at net/core/skbuff.c:1463!
    ...
    RIP: 0010:pskb_expand_head+0x253/0x2b0
    ...
    Call Trace:
    __pskb_pull_tail+0x49/0x460
    skb_cow_data+0x6f/0x300
    rxkad_verify_packet+0x18b/0xb10 [rxrpc]
    rxrpc_recvmsg_data.isra.11+0x4a8/0xa10 [rxrpc]
    rxrpc_kernel_recv_data+0x126/0x240 [rxrpc]
    afs_extract_data+0x51/0x2d0 [kafs]
    afs_deliver_fs_fetch_data+0x188/0x400 [kafs]
    afs_deliver_to_call+0xac/0x430 [kafs]
    afs_wait_for_call_to_complete+0x22f/0x3d0 [kafs]
    afs_make_call+0x282/0x3f0 [kafs]
    afs_fs_fetch_data+0x164/0x300 [kafs]
    afs_fetch_data+0x54/0x130 [kafs]
    afs_readpages+0x20d/0x340 [kafs]
    read_pages+0x66/0x180
    __do_page_cache_readahead+0x188/0x1a0
    ondemand_readahead+0x17d/0x2e0
    generic_file_read_iter+0x740/0xc10
    __vfs_read+0x145/0x1a0
    vfs_read+0x8c/0x140
    ksys_read+0x4a/0xb0
    do_syscall_64+0x43/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by using skb_unshare() instead in the input path for DATA packets
    that have a security index != 0. Non-DATA packets don't need in-place
    encryption and neither do unencrypted DATA packets.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Reported-by: Julian Wollrath
    Signed-off-by: David Howells

    David Howells
     
  • Use the previously-added transmit-phase skbuff private flag to simplify the
    socket buffer tracing a bit. Which phase the skbuff comes from can now be
    divined from the skb rather than having to be guessed from the call state.

    We can also reduce the number of rxrpc_skb_trace values by eliminating the
    difference between Tx and Rx in the symbols.

    Signed-off-by: David Howells

    David Howells
     
  • Add a flag in the private data on an skbuff to indicate that this is a
    transmission-phase buffer rather than a receive-phase buffer.

    Signed-off-by: David Howells

    David Howells
     
  • Abstract out rxtx ring cleanup into its own function from its two callers.
    This makes it easier to apply the same changes to both.

    Signed-off-by: David Howells

    David Howells
     
  • Pass the reference held on a DATA skb in the rxrpc input handler into the
    Rx ring rather than getting an additional ref for this and then dropping
    the original ref at the end.

    Signed-off-by: David Howells

    David Howells
     
  • Use the information now cached in the skbuff private data to avoid the need
    to reparse a jumbo packet. We can find all the subpackets by dead
    reckoning, so it's only necessary to note how many there are, whether the
    last one is flagged as LAST_PACKET and whether any have the REQUEST_ACK
    flag set.

    This is necessary as once recvmsg() can see the packet, it can start
    modifying it, such as doing in-place decryption.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     
  • Improve the information stored about jumbo packets so that we don't need to
    reparse them so much later.

    Signed-off-by: David Howells
    Reviewed-by: Jeffrey Altman

    David Howells
     
  • If the connection breaks while we're waiting for a reply from the
    server, then we want to immediately try to reconnect.

    Fixes: ec6017d90359 ("SUNRPC fix regression in umount of a secure mount")
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This reverts commit a79f194aa4879e9baad118c3f8bb2ca24dbef765.
    The mechanism for aborting I/O is racy, since we are not guaranteed that
    the request is asleep while we're changing both task->tk_status and
    task->tk_action.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org # v5.1

    Trond Myklebust