18 Aug, 2016

6 commits

  • Pull networking fixes from David Miller:

    1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

    2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

    3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0. From
    Satish Baddipadige and Siva Reddy Kallam.

    4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

    5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer. From David Forster.

    6) Several sctp_diag fixes from Phil Sutter.

    7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

    8) Checksum fixup fixes in bpf from Daniel Borkmann.

    9) Memork leaks in nfnetlink, from Liping Zhang.

    10) Use after free in rxrpc, from David Howells.

    11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

    12) Calipso resource leak, from Colin Ian King.

    13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

    14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

    15) Fix lockdep splats in macsec, from Sabrina Dubroca.

    16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

    17) Various tc-action bug fixes, from CONG Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net_sched: allow flushing tc police actions
    net_sched: unify the init logic for act_police
    net_sched: convert tcf_exts from list to pointer array
    net_sched: move tc offload macros to pkt_cls.h
    net_sched: fix a typo in tc_for_each_action()
    net_sched: remove an unnecessary list_del()
    net_sched: remove the leftover cleanup_a()
    mlxsw: spectrum: Allow packets to be trapped from any PG
    mlxsw: spectrum: Unmap 802.1Q FID before destroying it
    mlxsw: spectrum: Add missing rollbacks in error path
    mlxsw: reg: Fix missing op field fill-up
    mlxsw: spectrum: Trap loop-backed packets
    mlxsw: spectrum: Add missing packet traps
    mlxsw: spectrum: Mark port as active before registering it
    mlxsw: spectrum: Create PVID vPort before registering netdevice
    mlxsw: spectrum: Remove redundant errors from the code
    mlxsw: spectrum: Don't return upon error in removal path
    i40e: check for and deal with non-contiguous TCs
    ixgbe: Re-enable ability to toggle VLAN filtering
    ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
    ...

    Linus Torvalds
     
  • The act_police uses its own code to walk the
    action hashtable, which leads to that we could
    not flush standalone tc police actions, so just
    switch to tcf_generic_walker() like other actions.

    (Joint work from Roman and Cong.)

    Signed-off-by: Roman Mashak
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Roman Mashak
     
  • Jamal reported a crash when we create a police action
    with a specific index, this is because the init logic
    is not correct, we should always create one for this
    case. Just unify the logic with other tc actions.

    Fixes: a03e6fe56971 ("act_police: fix a crash during removal")
    Reported-by: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • As pointed out by Jamal, an action could be shared by
    multiple filters, so we can't use list to chain them
    any more after we get rid of the original tc_action.
    Instead, we could just save pointers to these actions
    in tcf_exts, since they are refcount'ed, so convert
    the list to an array of pointers.

    The "ugly" part is the action API still accepts list
    as a parameter, I just introduce a helper function to
    convert the array of pointers to a list, instead of
    relying on the C99 feature to iterate the array.

    Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
    Reported-by: Jamal Hadi Salim
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • This list_del() for tc action is not needed actually,
    because we only use this list to chain bulk operations,
    therefore should not be carried for latter operations.

    Fixes: ec0595cc4495 ("net_sched: get rid of struct tcf_common")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • After refactoring tc_action into tcf_common, we no
    longer need to cleanup temporary "actions" in list,
    they are permanently stored in the hashtable.

    Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
    Reported-by: Jamal Hadi Salim
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

17 Aug, 2016

1 commit


16 Aug, 2016

3 commits

  • tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
    call tipc_node_xmit_skb() on it.

    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
    RIP: 0010:[] [] tipc_node_xmit_skb+0x6b/0x140
    RSP: 0018:ffff8800595bfce8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
    RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
    RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
    R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
    R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
    FS: 00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
    DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
    Stack:
    0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
    ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
    0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
    Call Trace:
    [] tipc_shutdown+0x553/0x880
    [] SyS_shutdown+0x14b/0x170
    [] do_syscall_64+0x19c/0x410
    [] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
    RIP [] tipc_node_xmit_skb+0x6b/0x140
    RSP
    ---[ end trace 57b0484e351e71f1 ]---

    I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
    userspace is equipped to handle that. Anyway, this is better than a GPF
    and looks somewhat consistent with other tipc_msg_create() callers.

    Signed-off-by: Vegard Nossum
    Acked-by: Ying Xue
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Vegard Nossum
     
  • Ensure that the inner_protocol is set on transmit so that GSO segmentation,
    which relies on that field, works correctly.

    This is achieved by setting the inner_protocol in gre_build_header rather
    than each caller of that function. It ensures that the inner_protocol is
    set when gre_fb_xmit() is used to transmit GRE which was not previously the
    case.

    I have observed this is not the case when OvS transmits GRE using
    lwtunnel metadata (which it always does).

    Fixes: 38720352412a ("gre: Use inner_proto to obtain inner header protocol")
    Cc: Pravin Shelar
    Acked-by: Alexander Duyck
    Signed-off-by: Simon Horman
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Simon Horman
     
  • ping_v6_sendmsg does not set flowi6_oif in response to
    sin6_scope_id or sk_bound_dev_if, so it is not possible to use
    these APIs to ping an IPv6 address on a different interface.
    Instead, it sets flowi6_iif, which is incorrect but harmless.

    Stop setting flowi6_iif, and support various ways of setting oif
    in the same priority order used by udpv6_sendmsg.

    Tested: https://android-review.googlesource.com/#/c/254470/
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

15 Aug, 2016

1 commit

  • Remove unnecessary use of enable/disable callback notifications
    and the incorrect more space available check.

    The virtio_transport_tx_work handles when the TX virtqueue
    has more buffers available.

    Signed-off-by: Gerard Garcia
    Acked-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Gerard Garcia
     

14 Aug, 2016

6 commits

  • The idea for type_check in dev_get_nest_level() was to count the number
    of nested devices of the same type (currently, only macvlan or vlan
    devices).
    This prevented the false positive lockdep warning on configurations such
    as:

    eth0
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     
  • If IPv6 is disabled when the option is set to keep IPv6
    addresses on link down, userspace is unaware of this as
    there is no such indication via netlink. The solution is to
    remove the IPv6 addresses in this case, which results in
    netlink messages indicating removal of addresses in the
    usual manner. This fix also makes the behavior consistent
    with the case of having IPv6 disabled first, which stops
    IPv6 addresses from being added.

    Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
    Signed-off-by: Mike Manning
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Mike Manning
     
  • sctp_transport_seq_start() does not currently clear iter->start_fail on
    success, but relies on it being zero when it is allocated (by
    seq_open_net()).

    This can be a problem in the following sequence:

    open() // allocates iter (and implicitly sets iter->start_fail = 0)
    read()
    - iter->start() // fails and sets iter->start_fail = 1
    - iter->stop() // doesn't call sctp_transport_walk_stop() (correct)
    read() again
    - iter->start() // succeeds, but doesn't change iter->start_fail
    - iter->stop() // doesn't call sctp_transport_walk_stop() (wrong)

    We should initialize sctp_ht_iter::start_fail to zero if ->start()
    succeeds, otherwise it's possible that we leave an old value of 1 there,
    which will cause ->stop() to not call sctp_transport_walk_stop(), which
    causes all sorts of problems like not calling rcu_read_unlock() (and
    preempt_enable()), eventually leading to more warnings like this:

    BUG: sleeping function called from invalid context at mm/slab.h:388
    in_atomic(): 0, irqs_disabled(): 0, pid: 16551, name: trinity-c2
    Preemption disabled at:[] rhashtable_walk_start+0x46/0x150

    [] preempt_count_add+0x1fb/0x280
    [] _raw_spin_lock+0x12/0x40
    [] rhashtable_walk_start+0x46/0x150
    [] sctp_transport_walk_start+0x2f/0x60
    [] sctp_transport_seq_start+0x4d/0x150
    [] traverse+0x170/0x850
    [] seq_read+0x7cc/0x1180
    [] proc_reg_read+0xbc/0x180
    [] do_loop_readv_writev+0x134/0x210
    [] do_readv_writev+0x565/0x660
    [] vfs_readv+0x67/0xa0
    [] do_preadv+0x126/0x170
    [] SyS_preadv+0xc/0x10
    [] do_syscall_64+0x19c/0x410
    [] return_from_SYSCALL_64+0x0/0x6a
    [] 0xffffffffffffffff

    Notice that this is a subtly different stacktrace from the one in commit
    5fc382d875 ("net/sctp: terminate rhashtable walk correctly").

    Cc: Xin Long
    Cc: Herbert Xu
    Cc: Eric W. Biederman
    Cc: Marcelo Ricardo Leitner
    Signed-off-by: Vegard Nossum
    Acked-By: Neil Horman
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Vegard Nossum
     
  • If iriap_register_lsap() fails to allocate memory, self->lsap is
    set to NULL. However, none of the callers handle the failure and
    irlmp_connect_request() will happily dereference it:

    iriap_register_lsap: Unable to allocated LSAP!
    ================================================================================
    UBSAN: Undefined behaviour in net/irda/irlmp.c:378:2
    member access within null pointer of type 'struct lsap_cb'
    CPU: 1 PID: 15403 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #81
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
    04/01/2014
    0000000000000000 ffff88010c7e78a8 ffffffff82344f40 0000000041b58ab3
    ffffffff84f98000 ffffffff82344e94 ffff88010c7e78d0 ffff88010c7e7880
    ffff88010630ad00 ffffffff84a5fae0 ffffffff84d3f5c0 000000000000017a
    Call Trace:
    [] dump_stack+0xac/0xfc
    [] ubsan_epilogue+0xd/0x8a
    [] __ubsan_handle_type_mismatch+0x157/0x411
    [] irlmp_connect_request+0x7ac/0x970
    [] iriap_connect_request+0xa0/0x160
    [] state_s_disconnect+0x88/0xd0
    [] iriap_do_client_event+0x94/0x120
    [] iriap_getvaluebyclass_request+0x3e0/0x6d0
    [] irda_find_lsap_sel+0x1eb/0x630
    [] irda_connect+0x828/0x12d0
    [] SYSC_connect+0x22b/0x340
    [] SyS_connect+0x9/0x10
    [] do_syscall_64+0x1b3/0x4b0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

    The bug seems to have been around since forever.

    There's more problems with missing error checks in iriap_init() (and
    indeed all of irda_init()), but that's a bigger problem that needs
    very careful review and testing. This patch will fix the most serious
    bug (as it's easily reached from unprivileged userspace).

    I have tested my patch with a reproducer.

    Signed-off-by: Vegard Nossum
    Signed-off-by: David S. Miller

    Vegard Nossum
     
  • Fix the bpf_try_make_writable() helper and all call sites we have in BPF,
    it's currently defect with regards to skbs when the write_len spans into
    non-linear parts, no matter if cloned or not.

    There are multiple issues at once. First, using skb_store_bits() is not
    correct since even if we have a cloned skb, page frags can still be shared.
    To really make them private, we need to pull them in via __pskb_pull_tail()
    first, which also gets us a private head via pskb_expand_head() implicitly.

    This is for helpers like bpf_skb_store_bytes(), bpf_l3_csum_replace(),
    bpf_l4_csum_replace(). Really, the only thing reasonable and working here
    is to call skb_ensure_writable() before any write operation. Meaning, via
    pskb_may_pull() it makes sure that parts we want to access are pulled in and
    if not does so plus unclones the skb implicitly. If our write_len still fits
    the headlen and we're cloned and our header of the clone is not writable,
    then we need to make a private copy via pskb_expand_head(). skb_store_bits()
    is a bit misleading and only safe to store into non-linear data in different
    contexts such as 357b40a18b04 ("[IPV6]: IPV6_CHECKSUM socket option can
    corrupt kernel memory").

    For above BPF helper functions, it means after fixed bpf_try_make_writable(),
    we've pulled in enough, so that we operate always based on skb->data. Thus,
    the call to skb_header_pointer() and skb_store_bits() becomes superfluous.
    In bpf_skb_store_bytes(), the len check is unnecessary too since it can
    only pass in maximum of BPF stack size, so adding offset is guaranteed to
    never overflow. Also bpf_l3/4_csum_replace() helpers must test for proper
    offset alignment since they use __sum16 pointer for writing resulting csum.

    The remaining helpers that change skb data not discussed here yet are
    bpf_skb_vlan_push(), bpf_skb_vlan_pop() and bpf_skb_change_proto(). The
    vlan helpers internally call either skb_ensure_writable() (pop case) and
    skb_cow_head() (push case, for head expansion), respectively. Similarly,
    bpf_skb_proto_xlat() takes care to not mangle page frags.

    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Fixes: 3697649ff29e ("bpf: try harder on clones when writing into skb")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Currently, if calipso_genopt fails then the error exit path
    does not free the ipv6_opt_hdr new causing a memory leak. Fix
    this by kfree'ing new on the error exit path.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     

13 Aug, 2016

2 commits

  • While hashing out BPF's current_task_under_cgroup helper bits, it came
    to discussion that the skb_in_cgroup helper name was suboptimally chosen.

    Tejun says:

    So, I think in_cgroup should mean that the object is in that
    particular cgroup while under_cgroup in the subhierarchy of that
    cgroup. Let's rename the other subhierarchy test to under too. I
    think that'd be a lot less confusing going forward.

    [...]

    It's more intuitive and gives us the room to implement the real
    "in" test if ever necessary in the future.

    Since this touches uapi bits, we need to change this as long as v4.8
    is not yet officially released. Thus, change the helper enum and rename
    related bits.

    Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
    Reference: http://patchwork.ozlabs.org/patch/658500/
    Suggested-by: Sargun Dhillon
    Suggested-by: Tejun Heo
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user
    needs multiple different security services (e.g. krb5i and krb5p).

    - Stable patch to fix a regression introduced by the use of
    SO_REUSEPORT, and that prevented the use of multiple different NFS
    versions to the same server.

    - TCP socket reconnection timer fixes.

    - Patch from Neil to disable the use of IPv6 temporary addresses"

    * tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Cap the transport reconnection timer at 1/2 lease period
    NFSv4: Cleanup the setting of the nfs4 lease period
    SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
    SUNRPC: Fix reconnection timeouts
    NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
    SUNRPC: disable the use of IPv6 temporary addresses.
    SUNRPC: allow for upcalls for same uid but different gss service
    SUNRPC: Fix up socket autodisconnect
    SUNRPC: Handle EADDRNOTAVAIL on connection failures

    Linus Torvalds
     

12 Aug, 2016

1 commit

  • Pull virtio/vhost fixes and cleanups from Michael Tsirkin:
    "Misc fixes and cleanups all over the place"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio/s390: deprecate old transport
    virtio/s390: keep early_put_chars
    virtio_blk: Fix a slient kernel panic
    virtio-vsock: fix include guard typo
    vhost/vsock: fix vhost virtio_vsock_pkt use-after-free
    9p/trans_virtio: use kvfree() for iov_iter_get_pages_alloc()
    virtio: fix error handling for debug builds
    virtio: fix memory leak in virtqueue_add()

    Linus Torvalds
     

11 Aug, 2016

3 commits

  • The creation of a tunnel vport (geneve, gre, vxlan) brings up a
    corresponding netdev, a multi-step operation which can fail.

    For example, changing a vxlan vport's netdev state to 'up' binds the
    vport's socket to a UDP port - if the binding fails (e.g. due to the
    port being in use), the error is currently ignored giving the
    appearance that the tunnel vport creation completed successfully.

    Signed-off-by: Martynas Pumputis
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Martynas Pumputis
     
  • In commit cf6f7e1d5109 ("tipc: dump monitor attributes"),
    I dereferenced a pointer before checking if its valid.
    This is reported by static check Smatch as:
    net/tipc/monitor.c:733 tipc_nl_add_monitor_peer()
    warn: variable dereferenced before check 'mon' (see line 731)

    In this commit, we check for a valid monitor before proceeding
    with any other operation.

    Fixes: cf6f7e1d5109 ("tipc: dump monitor attributes")
    Reported-by: Dan Carpenter
    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for your net tree,
    they are:

    1) Use mod_timer_pending() to avoid reactivating a dead expectation in
    the h323 conntrack helper, from Liping Zhang.

    2) Oneliner to fix a type in the register name defined in the nf_tables
    header.

    3) Don't try to look further when we find an inactive elements with no
    descendants in the rbtree set implementation, otherwise we crash.

    4) Handle valid zero CSeq in the SIP conntrack helper, from
    Christophe Leroy.

    5) Don't display a trailing slash in conntrack helper with no classes
    via /proc/net/nf_conntrack_expect, from Liping Zhang.

    6) Fix an expectation leak during creation from the nfqueue path, again
    from Liping Zhang.

    7) Validate netlink port ID in verdict message from nfqueue, otherwise
    an injection can be possible. Again from Zhang.

    8) Reject conntrack tuples with different transport protocol on
    original and reply tuples, also from Zhang.

    9) Validate offset and length in nft_exthdr, make sure they are under
    sizeof(u8), from Laura Garcia Liebana.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Aug, 2016

8 commits

  • Fix the direct assignment of offset and length attributes included in
    nft_exthdr structure from u32 data to u8.

    Signed-off-by: Laura Garcia Liebana
    Signed-off-by: Pablo Neira Ayuso

    Laura Garcia Liebana
     
  • Adding fdb entries pointing to the bridge device uses fdb_insert(),
    which lacks various checks and does not respect added_by_user flag.

    As a result, some inconsistent behavior can happen:
    * Adding temporary entries succeeds but results in permanent entries.
    * Same goes for "dynamic" and "use".
    * Changing mac address of the bridge device causes deletion of
    user-added entries.
    * Replacing existing entries looks successful from userspace but actually
    not, regardless of NLM_F_EXCL flag.

    Use the same logic as other entries and fix them.

    Fixes: 3741873b4f73 ("bridge: allow adding of fdb entries pointing to the bridge device")
    Signed-off-by: Toshiaki Makita
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • When executing the script included below, the netns delete operation
    hangs with the following message (repeated at 10 second intervals):

    kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

    This occurs because a reference to the lo interface in the "secure" netns
    is still held by a dst entry in the xfrm bundle cache in the init netns.

    Address this problem by garbage collecting the tunnel netns flow cache
    when a cross-namespace vti interface receives a NETDEV_DOWN notification.

    A more detailed description of the problem scenario (referencing commands
    in the script below):

    (1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

    The vti_test interface is created in the init namespace. vti_tunnel_init()
    attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
    setting the tunnel net to &init_net.

    (2) ip link set vti_test netns secure

    The vti_test interface is moved to the "secure" netns. Note that
    the associated struct ip_tunnel still has tunnel->net set to &init_net.

    (3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

    The first packet sent using the vti device causes xfrm_lookup() to be
    called as follows:

    dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

    Note that tunnel->net is the init namespace, while skb_dst(skb) references
    the vti_test interface in the "secure" namespace. The returned dst
    references an interface in the init namespace.

    Also note that the first parameter to xfrm_lookup() determines which flow
    cache is used to store the computed xfrm bundle, so after xfrm_lookup()
    returns there will be a cached bundle in the init namespace flow cache
    with a dst referencing a device in the "secure" namespace.

    (4) ip netns del secure

    Kernel begins to delete the "secure" namespace. At some point the
    vti_test interface is deleted, at which point dst_ifdown() changes
    the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
    in the "secure" namespace however).
    Since nothing has happened to cause the init namespace's flow cache
    to be garbage collected, this dst remains attached to the flow cache,
    so the kernel loops waiting for the last reference to lo to go away.

    ip link add br1 type bridge
    ip link set dev br1 up
    ip addr add dev br1 1.1.1.1/8

    ip netns add secure
    ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
    ip link set vti_test netns secure
    ip netns exec secure ip link set vti_test up
    ip netns exec secure ip link s lo up
    ip netns exec secure ip addr add dev lo 192.168.100.1/24
    ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
    proto esp mode tunnel mark 1
    ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
    proto esp mode tunnel mark 1
    ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
    mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
    ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
    mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

    ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

    ip netns del secure

    Reported-by: Hangbin Liu
    Reported-by: Jan Tluka
    Signed-off-by: Lance Richardson
    Signed-off-by: David S. Miller

    Lance Richardson
     
  • Under certain conditions, the data_ready handler will discard a packet.
    These need to be freed.

    Signed-off-by: David Howells

    David Howells
     
  • Fix a use of a packet after it has been enqueued onto the packet processing
    queue in the data_ready handler. Once on a call's Rx queue, we mustn't
    touch it any more as it may be dequeued and freed by the call processor
    running on a work queue.

    Save the values we need before enqueuing.

    Without this, we can get an oops like the following:

    BUG: unable to handle kernel NULL pointer dereference at 000000000000009c
    IP: [] rxrpc_fast_process_packet+0x724/0xa11 [af_rxrpc]
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
    CPU: 2 PID: 0 Comm: swapper/2 Tainted: G E 4.7.0-fsdevel+ #1336
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    task: ffff88040d6863c0 task.stack: ffff88040d68c000
    RIP: 0010:[] [] rxrpc_fast_process_packet+0x724/0xa11 [af_rxrpc]
    RSP: 0018:ffff88041fb03a78 EFLAGS: 00010246
    RAX: ffffffffffffffff RBX: ffff8803ff195b00 RCX: 0000000000000001
    RDX: ffffffffa01854d1 RSI: 0000000000000008 RDI: ffff8803ff195b00
    RBP: ffff88041fb03ab0 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88041fb038c8 R11: 0000000000000000 R12: ffff880406874800
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000009c CR3: 0000000001c14000 CR4: 00000000001406e0
    Stack:
    ffff8803ff195ea0 ffff880408348800 ffff880406874800 ffff8803ff195b00
    ffff880408348800 ffff8803ff195ed8 0000000000000000 ffff88041fb03af0
    ffffffffa0186072 0000000000000000 ffff8804054da000 0000000000000000
    Call Trace:

    [] rxrpc_data_ready+0x89d/0xbae [af_rxrpc]
    [] __sock_queue_rcv_skb+0x24c/0x2b2
    [] __udp_queue_rcv_skb+0x4b/0x1bd
    [] udp_queue_rcv_skb+0x281/0x4db
    [] __udp4_lib_rcv+0x7ed/0x963
    [] udp_rcv+0x15/0x17
    [] ip_local_deliver_finish+0x1c3/0x318
    [] ip_local_deliver+0xbb/0xc4
    [] ? inet_del_offload+0x40/0x40
    [] ip_rcv_finish+0x3ce/0x42c
    [] ip_rcv+0x304/0x33d
    [] ? ip_local_deliver_finish+0x318/0x318
    [] __netif_receive_skb_core+0x601/0x6e8
    [] __netif_receive_skb+0x13/0x54
    [] netif_receive_skb_internal+0xbb/0x17c
    [] napi_gro_receive+0xf9/0x1bd
    [] rtl8169_poll+0x32b/0x4a8
    [] net_rx_action+0xe8/0x357
    [] __do_softirq+0x1aa/0x414
    [] irq_exit+0x3d/0xb0
    [] do_IRQ+0xe4/0xfc
    [] common_interrupt+0x93/0x93

    [] ? cpuidle_enter_state+0x1ad/0x2be
    [] ? cpuidle_enter_state+0x1a8/0x2be
    [] cpuidle_enter+0x12/0x14
    [] call_cpuidle+0x39/0x3b
    [] cpu_startup_entry+0x230/0x35d
    [] start_secondary+0xf4/0xf7

    Signed-off-by: David Howells

    David Howells
     
  • Once a packet has been posted to a connection in the data_ready handler, we
    mustn't try reposting if we then find that the connection is dying as the
    refcount has been given over to the dying connection and the packet might
    no longer exist.

    Losing the packet isn't a problem as the peer will retransmit.

    Signed-off-by: David Howells

    David Howells
     
  • The call state machine processor sets up the message parameters for a UDP
    message that it might need to transmit in advance on the basis that there's
    a very good chance it's going to have to transmit either an ACK or an
    ABORT. This requires it to look in the connection struct to retrieve some
    of the parameters.

    However, if the call is complete, the call connection pointer may be NULL
    to dissuade the processor from transmitting a message. However, there are
    some situations where the processor is still going to be called - and it's
    still going to set up message parameters whether it needs them or not.

    This results in a NULL pointer dereference at:

    net/rxrpc/call_event.c:837

    To fix this, skip the message pre-initialisation if there's no connection
    attached.

    Signed-off-by: David Howells

    David Howells
     
  • If rxrpc_new_client_call() fails to make a connection, the call record that
    it allocated needs to be marked as RXRPC_CALL_RELEASED before it is passed
    to rxrpc_put_call() to indicate that it no longer has any attachment to the
    AF_RXRPC socket.

    Without this, an assertion failure may occur at:

    net/rxrpc/call_object:635

    Signed-off-by: David Howells

    David Howells
     

09 Aug, 2016

9 commits

  • The memory allocated by iov_iter_get_pages_alloc() can be allocated with
    vmalloc() if kmalloc() failed -- see get_pages_array().

    In that case we need to free it with vfree(), so let's use kvfree().

    The bug manifests like this:

    BUG: unable to handle kernel paging request at ffffeb0400072da0
    IP: [] kfree+0x4b/0x140
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 2 PID: 675 Comm: trinity-c2 Not tainted 4.7.0-rc7+ #14
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff8800badef2c0 ti: ffff880069208000 task.ti: ffff880069208000
    RIP: 0010:[] [] kfree+0x4b/0x140
    RSP: 0000:ffff88006920f3f0 EFLAGS: 00010282
    RAX: ffffea0000000000 RBX: ffffc90001cb6000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffc90001cb6000
    RBP: ffff88006920f410 R08: 0000000000000000 R09: dffffc0000000000
    R10: ffff8800badefa30 R11: 0000056a3d3b0d9f R12: ffff88006920f620
    R13: ffffeb0400072d80 R14: ffff8800baa94078 R15: 0000000000000000
    FS: 00007fbd2b437700(0000) GS:ffff88011af00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffeb0400072da0 CR3: 000000006926d000 CR4: 00000000000006e0
    Stack:
    0000000000000001 ffff88006920f620 ffffed001755280f ffff8800baa94078
    ffff88006920f6a8 ffffffff8310442b dffffc0000000000 ffff8800badefa30
    ffff8800badefa28 ffff88011af1fba0 1ffff1000d241e98 ffff8800ba892150
    Call Trace:
    [] p9_virtio_zc_request+0x72b/0xdb0
    [] p9_client_zc_rpc.constprop.8+0x246/0xb10
    [] p9_client_read+0x4c9/0x750
    [] v9fs_fid_readpage+0x14c/0x320
    [] v9fs_vfs_readpage+0x36/0x50
    [] filemap_fault+0x9a3/0xe60
    [] __do_fault+0x158/0x300
    [] handle_mm_fault+0x1cf1/0x3c80
    [] __do_page_fault+0x30a/0x8e0
    [] do_page_fault+0x2f/0x80
    [] do_async_page_fault+0x27/0xa0
    [] async_page_fault+0x28/0x30
    Code: 00 80 41 54 53 49 01 fd 48 0f 42 05 b0 39 67 02 48 89 fb 49 01 c5 48 b8 00 00 00 00 00 ea ff ff 49 c1 ed 0c 49 c1 e5 06 49 01 c5 8b 45 20 48 8d 50 ff a8 01 4c 0f 45 ea 49 8b 55 20 48 8d 42
    RIP [] kfree+0x4b/0x140
    RSP
    CR2: ffffeb0400072da0
    ---[ end trace f3d59a04bafec038 ]---

    Cc: Al Viro
    Signed-off-by: Vegard Nossum
    Signed-off-by: Michael S. Tsirkin

    Vegard Nossum
     
  • A newly added bugfix caused an uninitialized variable to be
    used for printing debug output. This is harmless as long
    as the debug setting is disabled, but otherwise leads to an
    immediate crash.

    gcc warns about this when -Wmaybe-uninitialized is enabled:

    net/rxrpc/call_object.c: In function 'rxrpc_release_call':
    net/rxrpc/call_object.c:496:163: error: 'sp' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    The initialization was removed but one of the users remains.
    This adds back the initialization.

    Signed-off-by: Arnd Bergmann
    Fixes: 372ee16386bb ("rxrpc: Fix races between skb free, ACK generation and replying")
    Signed-off-by: David Howells

    Arnd Bergmann
     
  • Currently, user can add a conntrack with different l4proto via nfnetlink.
    For example, original tuple is TCP while reply tuple is SCTP. This is
    invalid combination, we should report EINVAL to userspace.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Like NFQNL_MSG_VERDICT_BATCH do, we should also reject the verdict
    request when the portid is not same with the initial portid(maybe
    from another process).

    Fixes: 97d32cf9440d ("netfilter: nfnetlink_queue: batch verdict support")
    Signed-off-by: Liping Zhang
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • User can use NFQA_EXP to attach expectations to conntracks, but we
    forget to put back nf_conntrack_expect when it is inserted successfully,
    i.e. in this normal case, expect's use refcnt will be 3. So even we
    unlink it and put it back later, the use refcnt is still 1, then the
    memory will be leaked forever.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • The 'name' filed in struct nf_conntrack_expect_policy{} is not a
    pointer, so check it is NULL or not will always return true. Even if the
    name is empty, slash will always be displayed like follows:
    # cat /proc/net/nf_conntrack_expect
    297 l3proto = 2 proto=6 src=1.1.1.1 dst=2.2.2.2 sport=1 dport=1025 ftp/
    ^

    Fixes: 3a8fc53a45c4 ("netfilter: nf_ct_helper: allocate 16 bytes for the helper and policy names")
    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Commit 52253db924d1 ("sctp: also point GSO head_skb to the sk when
    it's available") used event->chunk->head_skb to get the head_skb in
    sctp_ulpevent_set_owner().

    But at that moment, the event->chunk was NULL, as it cloned the skb
    in sctp_ulpevent_make_rcvmsg(). Therefore, that patch didn't really
    work.

    This patch is to move the event->chunk initialization before calling
    sctp_ulpevent_receive_data() so that it uses event->chunk when it's
    valid.

    Fixes: 52253db924d1 ("sctp: also point GSO head_skb to the sk when it's available")
    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Xin Long
     
  • When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
    push rcsum of mac header back in and after BPF run back pull out again as
    opposed to some other subsystems (ovs, for example).

    For cases like q-in-q, meaning when a vlan tag for offloading is already
    present and we're about to push another one, then skb_vlan_push() pushes the
    inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
    the 4 bytes vlan header diff. Likewise, for the reverse operation in
    skb_vlan_pop() for the case where vlan header needs to be pulled out of the
    skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
    rcsum of the vlan header that was removed.

    However mangling the rcsum here will lead to hw csum failure for BPF case,
    since we're pulling or pushing data that was not part of the current rcsum.
    Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
    is also not really an option since current behaviour is ABI by now, but apart
    from that would also mean to do quite a bit of useless work in the sense that
    usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
    touch this vlan related corner case. One way to fix it would be to push the
    necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
    anyway.

    Fixes: 4e10df9a60d9 ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
    flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
    Where we ran into an issue with bpf_skb_store_bytes() is when we did a
    single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
    flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
    underlying issue has been tracked down to a buffer alignment issue.

    Meaning, that csum_partial() computations via skb_postpull_rcsum() and
    skb_postpush_rcsum() pair invoked had a wrong result since they operated on
    an odd address for the hoplimit, while other computations were done on an
    even address. This mix doesn't work as-is with skb_postpull_rcsum(),
    skb_postpush_rcsum() pair as it always expects at least half-word alignment
    of input buffers, which is normally the case. Thus, instead of these helpers
    using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
    csum_block_add(), respectively. For unaligned offsets, they rotate the sum
    to align it to a half-word boundary again, otherwise they work the same as
    csum_sub() and csum_add().

    Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
    offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
    csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
    use a 0 constant for offset so that the compiler optimizes the offset & 1
    test away and generates the same code as with csum_sub()/_add().

    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann