04 May, 2016

4 commits

  • The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.

    Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
    CC: Tom Herbert
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • An arbitration scheme for duelling SYNs is implemented as part of
    commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
    involved will arrive at the same arbitration decision. However, this
    needs to be synchronized with an outgoing SYN to be generated by
    rds_tcp_conn_connect(). This commit achieves the synchronization
    through the t_conn_lock mutex in struct rds_tcp_connection.

    The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
    the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
    not already UP (an UP would indicate that rds_tcp_accept_one() has
    completed 3WH, so no SYN needs to be generated).

    Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
    acquiring the t_conn_lock mutex. The only acceptable states (to
    allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
    was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
    outgoing SYN before we saw incoming SYN).

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • There is a race condition between rds_send_xmit -> rds_tcp_xmit
    and the code that deals with resolution of duelling syns added
    by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()").

    Specifically, we may end up derefencing a null pointer in rds_send_xmit
    if we have the interleaving sequence:
    rds_tcp_accept_one rds_send_xmit

    conn is RDS_CONN_UP, so
    invoke rds_tcp_xmit

    tc = conn->c_transport_data
    rds_tcp_restore_callbacks
    /* reset t_sock */
    null ptr deref from tc->t_sock

    The race condition can be avoided without adding the overhead of
    additional locking in the xmit path: have rds_tcp_accept_one wait
    for rds_tcp_xmit threads to complete before resetting callbacks.
    The synchronization can be done in the same manner as rds_conn_shutdown().
    First set the rds_conn_state to something other than RDS_CONN_UP
    (so that new threads cannot get into rds_tcp_xmit()), then wait for
    RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
    threads in rds_tcp_xmit are done.

    Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()")
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
    offload for tunnels. With this being the case we should disable GSO in
    addition to the checksum offload features when we find that a device cannot
    perform a checksum on a given packet type.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

03 May, 2016

3 commits

  • This was recently reported to me, and reproduced on the latest net kernel,
    when attempting to run netperf from a host that had a netem qdisc attached
    to the egress interface:

    [ 788.073771] ---------------------[ cut here ]---------------------------
    [ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
    [ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
    data_len=0 gso_size=1448 gso_type=1 ip_summed=3
    [ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
    ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
    glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
    i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
    pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
    sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
    i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
    crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
    serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
    dm_mod
    [ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
    ------------ 3.10.0-327.el7.x86_64 #1
    [ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
    [ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
    ffffffff816351f1
    [ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
    ffff880231674000
    [ 788.611943] 0000000000000001 0000000000000003 0000000000000000
    ffff880437c03710
    [ 788.647241] Call Trace:
    [ 788.658817] [] dump_stack+0x19/0x1b
    [ 788.686193] [] warn_slowpath_common+0x70/0xb0
    [ 788.713803] [] warn_slowpath_fmt+0x5c/0x80
    [ 788.741314] [] ? ___ratelimit+0x93/0x100
    [ 788.767018] [] skb_warn_bad_offload+0xcd/0xda
    [ 788.796117] [] skb_checksum_help+0x17c/0x190
    [ 788.823392] [] netem_enqueue+0x741/0x7c0 [sch_netem]
    [ 788.854487] [] dev_queue_xmit+0x2a8/0x570
    [ 788.880870] [] ip_finish_output+0x53d/0x7d0
    ...

    The problem occurs because netem is not prepared to handle GSO packets (as it
    uses skb_checksum_help in its enqueue path, which cannot manipulate these
    frames).

    The solution I think is to simply segment the skb in a simmilar fashion to the
    way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
    When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
    the first segment, and enqueue the remaining ones.

    tested successfully by myself on the latest net kernel, to which this applies

    Signed-off-by: Neil Horman
    CC: Jamal Hadi Salim
    CC: "David S. Miller"
    CC: netem@lists.linux-foundation.org
    CC: eric.dumazet@gmail.com
    CC: stephen@networkplumber.org
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     
  • Antonio Quartulli says:

    ====================
    In this small batch of patches you have:
    - a fix for our Distributed ARP Table that makes sure that the input
    provided to the hash function during a query is the same as the one
    provided during an insert (so to prevent false negatives), by Antonio
    Quartulli
    - a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
    that a hard interface is properly re-activated when it is brought down
    and then up again, by Antonio Quartulli
    - two fixes respectively to the reference counting of the tt_local_entry
    and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
    as it would prevent the netdev objects references by batman-adv from
    being released after shutdown.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips,
    from Sara Sharon.

    2) Fix SKB size checks in batman-adv stack on receive, from Sven
    Eckelmann.

    3) Leak fix on mac80211 interface add error paths, from Johannes Berg.

    4) Cannot invoke napi_disable() with BH disabled in myri10ge driver,
    fix from Stanislaw Gruszka.

    5) Fix sign extension problem when computing feature masks in
    net_gso_ok(), from Marcelo Ricardo Leitner.

    6) lan78xx driver doesn't count packets and packet lengths in its
    statistics properly, fix from Woojung Huh.

    7) Fix the buffer allocation sizes in pegasus USB driver, from Petko
    Manolov.

    8) Fix refcount overflows in bpf, from Alexei Starovoitov.

    9) Unified dst cache handling introduced a preempt warning in
    ip_tunnel, fix by resetting rather then setting the cached route.
    From Paolo Abeni.

    10) Listener hash collision test fix in soreuseport, from Craig Gallak

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
    gre: do not pull header in ICMP error processing
    net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
    tipc: only process unicast on intended node
    cxgb3: fix out of bounds read
    net/smscx5xx: use the device tree for mac address
    soreuseport: Fix TCP listener hash collision
    net: l2tp: fix reversed udp6 checksum flags
    ip_tunnel: fix preempt warning in ip tunnel creation/updating
    samples/bpf: fix trace_output example
    bpf: fix check_map_func_compatibility logic
    bpf: fix refcnt overflow
    drivers: net: cpsw: use of_phy_connect() in fixed-link case
    dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
    drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
    drivers: net: cpsw: fix segfault in case of bad phy-handle
    drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
    MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
    gre: reject GUE and FOU in collect metadata mode
    pegasus: fixes reported packet length
    pegasus: fixes URB buffer allocation size;
    ...

    Linus Torvalds
     

02 May, 2016

4 commits

  • iptunnel_pull_header expects that IP header was already pulled; with this
    expectation, it pulls the tunnel header. This is not true in gre_err.
    Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
    to the IP header.

    We cannot pull the tunnel header in this path. It's just a matter of not
    calling iptunnel_pull_header - we don't need any of its effects.

    Fixes: bda7bb463436 ("gre: Allow multiple protocol listener for gre protocol.")
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • We have observed complete lock up of broadcast-link transmission due to
    unacknowledged packets never being removed from the 'transmq' queue. This
    is traced to nodes having their ack field set beyond the sequence number
    of packets that have actually been transmitted to them.
    Consider an example where node 1 has sent 10 packets to node 2 on a
    link and node 3 has sent 20 packets to node 2 on another link. We
    see examples of an ack from node 2 destined for node 3 being treated as
    an ack from node 2 at node 1. This leads to the ack on the node 1 to node
    2 link being increased to 20 even though we have only sent 10 packets.
    When node 1 does get around to sending further packets, none of the
    packets with sequence numbers less than 21 are actually removed from the
    transmq.
    To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc:
    reduce locking scope during packet reception") which ensures that only
    messages destined for the receiving node are processed by that node. This
    prevents the sequence numbers from getting out of sync and resolves the
    packet leakage, thereby resolving the broadcast-link transmission
    lock-ups we observed.

    While we are aware that this change only patches over a root problem that
    we still haven't identified, this is a sanity test that it is always
    legitimate to do. It will remain in the code even after we identify and
    fix the real problem.

    Reviewed-by: Chris Packham
    Reviewed-by: John Thompson
    Signed-off-by: Hamish Martin
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hamish Martin
     
  • I forgot to include a check for listener port equality when deciding
    if two sockets should belong to the same reuseport group. This was
    not caught previously because it's only necessary when two listening
    sockets for the same user happen to hash to the same listener bucket.
    The same error does not exist in the UDP path.

    Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • This patch fixes a bug which causes the behavior of whether to ignore
    udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
    userspace program requests.

    When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
    expected that udp6 checksums of received packets of the l2tp tunnel
    to create should be ignored. In `l2tp_netlink.c`:
    `l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
    according to the flag, and then passed to `l2tp_core.c`:
    `l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
    `l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
    the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
    checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
    to `false`, i.e. be set to the contrary. Similarly, the same should be
    done to `udp_conf.use_udp6_tx_checksums`.

    Signed-off-by: Miao Wang
    Acked-by: James Chapman
    Signed-off-by: David S. Miller

    Wang Shanker
     

30 Apr, 2016

1 commit

  • After the commit e09acddf873b ("ip_tunnel: replace dst_cache with generic
    implementation"), a preemption debug warning is triggered on ip4
    tunnels updating; the dst cache helper needs to be invoked in unpreemptible
    context.

    We don't need to load the cache on tunnel update, so this commit fixes
    the warning replacing the load with a dst cache reset, which is
    preempt safe.

    Fixes: e09acddf873b ("ip_tunnel: replace dst_cache with generic implementation")
    Reported-by: Eric Dumazet
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

29 Apr, 2016

10 commits

  • The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
    an implicit reference to it. But this reference was never stored in form of
    a pointer in the batadv_neigh_node itself. Instead
    batadv_neigh_node_release depends on a consistent state of
    hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
    batadv_hardif_neigh_node object which it has a reference for. But
    batadv_hardif_neigh_get cannot guarantee that because it is working only
    with rcu_read_lock on this list. It can therefore happen that a neigh_addr
    is in this list twice or that batadv_hardif_neigh_get cannot find the
    batadv_hardif_neigh_node for an neigh_addr due to some other list
    operations taking place at the same time.

    Instead add a batadv_hardif_neigh_node pointer directly in
    batadv_neigh_node which will be used for the reference counter decremented
    on release of batadv_neigh_node.

    Fixes: cef63419f7db ("batman-adv: add list of unique single hop neighbors per hard-interface")
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     
  • The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
    implicit reference to it. But this reference was never stored in form of a
    pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
    batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
    on a consistent state of bat_priv->softif_vlan_list and that
    batadv_softif_vlan_get always returns the batadv_softif_vlan object which
    it has a reference for. But batadv_softif_vlan_get cannot guarantee that
    because it is working only with rcu_read_lock on this list. It can
    therefore happen that an vid is in this list twice or that
    batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
    some other list operations taking place at the same time.

    Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
    which will be used for the reference counter decremented on release of
    batadv_tt_local_entry.

    Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
    Signed-off-by: Sven Eckelmann
    Acked-by: Antonio Quartulli
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     
  • At the moment there is no explicit reactivation of an hard-interface
    upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
    reactivated as soon as the next OGM is scheduled for sending, but this
    mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
    on the same scheduling mechanism as its predecessor and for this reason
    the hard-interface remains deactivated forever after being brought down
    once.

    This patch fixes the reactivation mechanism by adding a new routing API
    which explicitly allows each algorithm to perform any needed operation
    upon interface re-activation.

    Such API is optional and is implemented by B.A.T.M.A.N. V only and it
    just takes care of setting the iface status to ACTIVE

    Signed-off-by: Antonio Quartulli
    Signed-off-by: Marek Lindner

    Antonio Quartulli
     
  • Now that DAT is VLAN aware, it must use the VID when
    computing the DHT address of the candidate nodes where
    an entry is going to be stored/retrieved.

    Fixes: be1db4f6615b ("batman-adv: make the Distributed ARP Table vlan aware")
    Signed-off-by: Antonio Quartulli
    [sven@narfation.org: fix conflicts with current version]
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner

    Antonio Quartulli
     
  • Pull Ceph fixes from Sage Weil:
    "There is a lifecycle fix in the auth code, a fix for a narrow race
    condition on map, and a helpful message in the log when there is a
    feature mismatch (which happens frequently now that the default
    server-side options have changed)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: report unsupported features to syslog
    rbd: fix rbd map vs notify races
    libceph: make authorizer destruction independent of ceph_auth_client

    Linus Torvalds
     
  • The collect metadata mode does not support GUE nor FOU. This might be
    implemented later; until then, we should reject such config.

    I think this is okay to be changed. It's unlikely anyone has such
    configuration (as it doesn't work anyway) and we may need a way to
    distinguish whether it's supported or not by the kernel later.

    For backwards compatibility with iproute2, it's not possible to just check
    the attribute presence (iproute2 always includes the attribute), the actual
    value has to be checked, too.

    Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
    contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
    case, the interface is ARPHRD_IPGRE and the protocol to be used for
    encapsulation is skb->protocol.

    Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
    is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a048c).
    This is not the case, we're controlling the encapsulation addresses by
    lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
    mode does not make sense.

    Although it would be more user firendly to reject requests that specify
    both the collect metadata flag and a remote/local IP address, this would
    break current users of gretap or introduce ugly code and differences in
    handling ipgre and gretap configuration. Keep the current behavior of
    remote/local IP address being ignored in such case.

    v3: Back to v1, added explanation paragraph.
    v2: Reject configuration specifying both remote/local address and collect
    metadata flag.

    Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    Just a single fix, for a per-CPU memory leak in a
    (root user triggerable) error case.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Antonio Quartulli says:

    ====================
    In this patchset you can find the following fixes:

    1) check skb size to avoid reading beyond its border when delivering
    payloads, by Sven Eckelmann
    2) initialize last_seen time in neigh_node object to prevent cleanup
    routine from accidentally purge it, by Marek Lindner
    3) release "recently added" slave interfaces upon virtual/batman
    interface shutdown, by Sven Eckelmann
    4) properly decrease router object reference counter upon routing table
    update, by Sven Eckelmann
    5) release queue slots when purging OGM packets of deactivating slave
    interface, by Linus Lüssing

    Patch 2 and 3 have no "Fixes:" tag because the offending commits date
    back to when batman-adv was not yet officially in the net tree.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Apr, 2016

1 commit


26 Apr, 2016

4 commits

  • It was a simple idea -- save IPv6 configured addresses on a link down
    so that IPv6 behaves similar to IPv4. As always the devil is in the
    details and the IPv6 stack as too many behavioral differences from IPv4
    making the simple idea more complicated than it needs to be.

    The current implementation for keeping IPv6 addresses can panic or spit
    out a warning in one of many paths:

    1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
    rt6_fill_node while handling a route dump request.

    2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
    fib6_del

    3. Panic in fib6_purge_rt because rt6i_ref count is not 1.

    The root cause of all these is references related to the host route for
    an address that is retained.

    So, this patch deletes the host route every time the ifdown loop runs.
    Since the host route is deleted and will be re-generated an up there is
    no longer a need for the l3mdev fix up. On the 'admin up' side move
    addrconf_permanent_addr into the NETDEV_UP event handling so that it
    runs only once versus on UP and CHANGE events.

    All of the current panics and warnings appear to be related to
    addresses on the loopback device, but given the catastrophic nature when
    a bug is triggered this patch takes the conservative approach and evicts
    all host routes rather than trying to determine when it can be re-used
    and when it can not. That can be a later optimizaton if desired.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • This reverts commit 841645b5f2dfceac69b78fcd0c9050868d41ea61.

    Ok, this puts the feature back. I've decided to apply David A.'s
    bug fix and run with that rather than make everyone wait another
    whole release for this feature.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This reverts the following three commits:

    70af921db6f8835f4b11c65731116560adb00c14
    799977d9aafbf0ca0b9c39b04cbfb16db71302c9
    f1705ec197e705b79ea40fe7a2cc5acfa1d3bfac

    The feature was ill conceived, has terrible semantics, and has added
    nothing but regressions to the already fragile ipv6 stack.

    Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Starting the kernel client with cephx disabled and then enabling cephx
    and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[] [] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116] [] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705] [] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852] [] put_osd+0x45/0x80 [libceph]
    [262674.134249] [] remove_osd+0xae/0x140 [libceph]
    [262674.189124] [] __reset_osd+0x103/0x150 [libceph]
    [262674.243749] [] kick_requests+0x223/0x460 [libceph]
    [262674.297485] [] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813] [] dispatch+0x4e/0x720 [libceph]
    [262674.403312] [] try_read+0x3d1/0x1090 [libceph]
    [262674.454712] [] ? dequeue_entity+0x152/0x690
    [262674.505096] [] con_work+0xcb/0x1300 [libceph]
    [262674.555104] [] process_one_work+0x14e/0x3d0
    [262674.604072] [] worker_thread+0x11a/0x470
    [262674.652187] [] ? rescuer_thread+0x310/0x310
    [262674.699022] [] kthread+0xd2/0xf0
    [262674.744494] [] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543] [] ret_from_fork+0x3f/0x70
    [262674.834094] [] ? kthread_create_on_node+0x1c0/0x1c0

    What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
    ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

    osd->o_auth.authorizer in the "none" case is just a bare pointer into
    ac, which contains a single static copy for all services. By the time
    we get to (4), "none" ac, freed in (2), is long gone. On top of that,
    a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
    so we end up trying to destroy a "none" authorizer with a "cephx"
    destructor operating on invalid memory!

    To fix this, decouple authorizer destruction from ac and do away with
    a single static "none" authorizer by making a copy for each OSD or MDS
    session. Authorizers themselves are independent of ac and so there is
    no reason for destroy_authorizer() to be an ac op. Make it an op on
    the authorizer itself by turning ceph_authorizer into a real struct.

    Fixes: http://tracker.ceph.com/issues/15447

    Reported-by: Alan Zhang
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     

25 Apr, 2016

4 commits

  • After commit fbd40ea0180a ("ipv4: Don't do expensive useless work
    during inetdev destroy.") when deleting an interface,
    fib_del_ifaddr() can be executed without any primary address
    present on the dead interface.

    The above is safe, but triggers some "bug: prim == NULL" warnings.

    This commit avoids warning if the in_dev is dead

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • There is a race-condition when updating the mdb offload flag without using
    the mulicast_lock. This reverts commit 9e8430f8d60d98 ("bridge: mdb:
    Passing the port-group pointer to br_mdb module").

    This patch marks offloaded MDB entry as "offload" by changing the port-
    group flags and marks it as MDB_PG_FLAGS_OFFLOAD.

    When switchdev PORT_MDB succeeded and adds a multicast group, a completion
    callback is been invoked "br_mdb_complete". The completion function
    locks the multicast_lock and finds the right net_bridge_port_group and
    marks it as offloaded.

    Fixes: 9e8430f8d60d98 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
    Reported-by: Nikolay Aleksandrov
    Signed-off-by: Elad Raz
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Elad Raz
     
  • There is duplicate code that translates br_mdb_entry to br_ip let's wrap it
    in a common function.

    Signed-off-by: Elad Raz
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Elad Raz
     
  • When using switchdev deferred operation (SWITCHDEV_F_DEFER), the operation
    is executed in different context and the application doesn't have any way
    to get the operation real status.

    Adding a completion callback fixes that. This patch adds fields to
    switchdev_attr and switchdev_obj "complete_priv" field which is used by
    the "complete" callback.

    Application can set a complete function which will be called once the
    operation executed.

    Signed-off-by: Elad Raz
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Elad Raz
     

24 Apr, 2016

5 commits

  • When removing a single interface while a broadcast or ogm packet is
    still pending then we will free the forward packet without releasing the
    queue slots again.

    This patch is supposed to fix this issue.

    Fixes: 6d5808d4ae1b ("batman-adv: Add missing hardif_free_ref in forw_packet_free")
    Signed-off-by: Linus Lüssing
    [sven@narfation.org: fix conflicts with current version]
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Linus Lüssing
     
  • _batadv_update_route rcu_derefences orig_ifinfo->router outside of a
    spinlock protected region to print some information messages to the debug
    log. But this pointer is not checked again when the new pointer is assigned
    in the spinlock protected region. Thus is can happen that the value of
    orig_ifinfo->router changed in the meantime and thus the reference counter
    of the wrong router gets reduced after the spinlock protected region.

    Just rcu_dereferencing the value of orig_ifinfo->router inside the spinlock
    protected region (which also set the new pointer) is enough to get the
    correct old router object.

    Fixes: e1a5382f978b ("batman-adv: Make orig_node->router an rcu protected pointer")
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     
  • The shutdown of an batman-adv interface can happen with one of its slave
    interfaces still being in the BATADV_IF_TO_BE_ACTIVATED state. A possible
    reason for it is that the routing algorithm BATMAN_V was selected and
    batadv_schedule_bat_ogm was not yet called for this interface. This slave
    interface still has to be set to BATADV_IF_INACTIVE or the batman-adv
    interface will never reduce its usage counter and thus never gets shutdown.

    This problem can be simulated via:

    $ modprobe dummy
    $ modprobe batman-adv routing_algo=BATMAN_V
    $ ip link add bat0 type batadv
    $ ip link set dummy0 master bat0
    $ ip link set dummy0 up
    $ ip link del bat0
    unregister_netdevice: waiting for bat0 to become free. Usage count = 3

    Reported-by: Matthias Schiffer
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     
  • Signed-off-by: Marek Lindner
    [sven@narfation.org: fix conflicts with current version]
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Antonio Quartulli

    Marek Lindner
     
  • The encapsulated ethernet and VLAN header may be outside the received
    ethernet frame. Thus the skb buffer size has to be checked before it can be
    parsed to find out if it encapsulates another batman-adv packet.

    Fixes: 420193573f11 ("batman-adv: softif bridge loop avoidance")
    Signed-off-by: Sven Eckelmann
    Signed-off-by: Marek Lindner
    Signed-off-by: Antonio Quartulli

    Sven Eckelmann
     

22 Apr, 2016

4 commits

  • Pull networking fixes from David Miller:

    1) Fix memory leak in iwlwifi, from Matti Gottlieb.

    2) Add missing registration of netfilter arp_tables into initial
    namespace, from Florian Westphal.

    3) Fix potential NULL deref in DecNET routing code.

    4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry
    Ivanov.

    5) Fix dst ref counting in VRF, from David Ahern.

    6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck.

    7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause.

    8) Ravalidate IPV6 datagram socket cached routes properly, particularly
    with UDP, from Martin KaFai Lau.

    9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang.

    10) Fix stats typing in bcmgenet driver, from Eric Dumazet.

    11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing,
    from Joe Stringer.

    12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown.

    13) atl2 doesn't actually support scatter-gather, so don't advertise the
    feature. From Ben Hucthings.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
    openvswitch: use flow protocol when recalculating ipv6 checksums
    Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets
    atl2: Disable unimplemented scatter/gather feature
    net/mlx4_en: Split SW RX dropped counter per RX ring
    net/mlx4_core: Don't allow to VF change global pause settings
    net/mlx4_core: Avoid repeated calls to pci enable/disable
    net/mlx4_core: Implement pci_resume callback
    net: phy: spi_ks8895: Don't leak references to SPI devices
    net: ethernet: davinci_emac: Fix platform_data overwrite
    net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable
    qede: Fix single MTU sized packet from firmware GRO flow
    qede: Fix setting Skb network header
    qede: Fix various memory allocation error flows for fastpath
    tcp: Merge tx_flags and tskey in tcp_shifted_skb
    tcp: Merge tx_flags and tskey in tcp_collapse_retrans
    drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open
    tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
    openvswitch: Orphan skbs before IPv6 defrag
    Revert "Prevent NUll pointer dereference with two PHYs on cpsw"
    VSOCK: Only check error on skb_recv_datagram when skb is NULL
    ...

    Linus Torvalds
     
  • When using masked actions the ipv6_proto field of an action
    to set IPv6 fields may be zero rather than the prevailing protocol
    which will result in skipping checksum recalculation.

    This patch resolves the problem by relying on the protocol
    in the flow key rather than that in the set field action.

    Fixes: 83d2b9ba1abc ("net: openvswitch: Support masked set actions.")
    Cc: Jarno Rajahalme
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Simon Horman
     
  • After receiving sacks, tcp_shifted_skb() will collapse
    skbs if possible. tx_flags and tskey also have to be
    merged.

    This patch reuses the tcp_skb_collapse_tstamp() to handle
    them.

    BPF Output Before:
    ~~~~~

    BPF Output After:
    ~~~~~
    -2024 [007] d.s. 88.644374: : ee_data:14599

    Packetdrill Script:
    ~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 write(4, ..., 1460) = 1460
    +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
    0.200 write(4, ..., 13140) = 13140

    0.200 > P. 1:1461(1460) ack 1
    0.200 > . 1461:8761(7300) ack 1
    0.200 > P. 8761:14601(5840) ack 1

    0.300 < . 1:1(0) ack 1 win 257
    0.300 > P. 1:1461(1460) ack 1
    0.400 < . 1:1(0) ack 14601 win 257

    0.400 close(4) = 0
    0.400 > F. 14601:14601(0) ack 1
    0.500 < F. 1:1(0) ack 14602 win 257
    0.500 > . 14602:14602(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Tested-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • If two skbs are merged/collapsed during retransmission, the current
    logic does not merge the tx_flags and tskey. The end result is
    the SCM_TSTAMP_ACK timestamp could be missing for a packet.

    The patch:
    1. Merge the tx_flags
    2. Overwrite the prev_skb's tskey with the next_skb's tskey

    BPF Output Before:
    ~~~~~~

    BPF Output After:
    ~~~~~~
    packetdrill-2092 [001] d.s. 453.998486: : ee_data:1459

    Packetdrill Script:
    ~~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 write(4, ..., 730) = 730
    +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
    0.200 write(4, ..., 730) = 730
    +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
    0.200 write(4, ..., 11680) = 11680
    +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0

    0.200 > P. 1:731(730) ack 1
    0.200 > P. 731:1461(730) ack 1
    0.200 > . 1461:8761(7300) ack 1
    0.200 > P. 8761:13141(4380) ack 1

    0.300 < . 1:1(0) ack 1 win 257
    0.300 < . 1:1(0) ack 1 win 257
    0.300 < . 1:1(0) ack 1 win 257
    0.300 > P. 1:1461(1460) ack 1
    0.400 < . 1:1(0) ack 13141 win 257

    0.400 close(4) = 0
    0.400 > F. 13141:13141(0) ack 1
    0.500 < F. 1:1(0) ack 13142 win 257
    0.500 > . 13142:13142(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Tested-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau