24 Aug, 2016

13 commits

  • In hci_req_sync_complete the event skb is referenced in hdev->req_skb.
    It is used (via hci_req_run_skb) from either __hci_cmd_sync_ev which will
    pass the skb to the caller, or __hci_req_sync which leaks.

    unreferenced object 0xffff880005339a00 (size 256):
    comm "kworker/u3:1", pid 1011, jiffies 4294671976 (age 107.389s)
    backtrace:
    [] kmemleak_alloc+0x49/0xa0
    [] kmem_cache_alloc+0x128/0x180
    [] skb_clone+0x4f/0xa0
    [] hci_event_packet+0xc1/0x3290
    [] hci_rx_work+0x18b/0x360
    [] process_one_work+0x14a/0x440
    [] worker_thread+0x43/0x4d0
    [] kthread+0xc4/0xe0
    [] ret_from_fork+0x1f/0x40
    [] 0xffffffffffffffff

    Signed-off-by: Frédéric Dalleau
    Signed-off-by: Marcel Holtmann

    Frederic Dalleau
     
  • inet_diag_find_one_icsk takes a reference to a socket that is not
    released if sock_diag_destroy returns an error. Fix by changing
    tcp_diag_destroy to manage the refcnt for all cases and remove
    the sock_put calls from tcp_abort.

    Fixes: c1e64e298b8ca ("net: diag: Support destroying TCP sockets")
    Reported-by: Lorenzo Colitti
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
    software transmit timestamp of a packet.

    sock_tx_timestamp resets and overrides the tx_flags of the skb.
    The function is intended to be called from within the protocol
    layer when creating the skb, not from a device driver. This is
    inconsistent with other drivers and will cause issues for TCP.

    In TCP, we intend to sample the timestamps for the last byte
    for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
    tcp_tx_timestamp only with the last skb that it generates.
    For example, if a 128KB message is split into two 64KB packets
    we want to sample the SND timestamp of the last packet. The current
    code in the tun driver, however, will result in sampling the SND
    timestamp for both packets.

    Also, when the last packet is split into smaller packets for
    retranmission (see tcp_fragment), the tun driver will record
    timestamps for all of the retransmitted packets and not only the
    last packet.

    Fixes: eda297729171 (tun: Support software transmit time stamping.)
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Francis Yan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • After commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
    we do not need this special allocation mode anymore, even if it is
    harmless.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The function sctp_diag_dump_one() currently performs a memcpy()
    of 64 bytes from a 16 byte field into another 16 byte field. Fix
    by using correct size, use sizeof to obtain correct size instead
    of using a hard-coded constant.

    Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
    Signed-off-by: Lance Richardson
    Reviewed-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Lance Richardson
     
  • We currently enable interrupts before we enable NAPI. If an RX interrupt
    hits before we enabled NAPI then the NAPI callback is never called and
    we leave the hardware with RX interrupts disabled, which of course leads
    us to never handling received packets. Fix this by moving the interrupt
    enable to after we've enable NAPI and the reclaim tasklet.

    Fixes: cd5e41234729 ("dwc_eth_qos: do phy_start before resetting hardware")
    Signed-off-by: Rabin Vincent
    Signed-off-by: Lars Persson
    Signed-off-by: David S. Miller

    Rabin Vincent
     
  • clk_prepare_enable() may fail, so we should better check its return
    value and propagate it in the case of failure

    While at it, replace __lpc_eth_clock_enable() with a plain
    clk_prepare_enable/clk_disable_unprepare() call in order to
    simplify the code.

    Signed-off-by: Fabio Estevam
    Acked-by: Vladimir Zapolskiy
    Signed-off-by: David S. Miller

    Fabio Estevam
     
  • The PORT_RATE_CONTROL register works differently on 88e6095/6095f/6131
    in comparison to 6123/61/65, and 0x0 disables. The distinction was lost
    Linux 4.1 --> 4.2

    Signed-off-by: Jamie Lentin
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Jamie Lentin
     
  • Like the ksz8081, the ksz9031 has the behavior where it will clear the
    interrupt enable bits when leaving power down. This takes advantage of the
    solution provided by f5aba91.

    Signed-off-by: Xander Huff
    Signed-off-by: Nathan Sullivan
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Xander Huff
     
  • When sending an ack in SYN_RECV state, we must scale the offered
    window if wscale option was negotiated and accepted.

    Tested:
    Following packetdrill test demonstrates the issue :

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    // Establish a connection.
    +0 < S 0:0(0) win 20000
    +0 > S. 0:0(0) ack 1 win 28960

    +0 < . 1:11(10) ack 1 win 156
    // check that window is properly scaled !
    +0 > . 1:1(0) ack 1 win 226

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The current scatter-gather logic in gianfar is flawed, since
    it does not consider the eTSEC's RxBD 'Data Length' field is
    context depening: for the last fragment it contains the full
    frame size, while fragments contain the fragment size, which
    equals the value written to register MRBLR.

    This causes data corruption as soon as the hardware starts
    to fragment receiving frames. As a result, the size of
    fragmented frames is increased by
    (nr_frags - 1) * MRBLR

    We first noticed this issue working with DSA, where an ICMP
    request sized 1472 bytes causes the scatter-gather logic to
    kick in. The full Ethernet frame (1518) gets increased by
    DSA (4), GMAC_FCB_LEN (8), and FSL_GIANFAR_DEV_HAS_TIMER
    (priv->padding=8) to a total of 1538 octets, which is
    fragmented by the hardware and reconstructed by the driver
    to a 3074 octet frame.

    This patch fixes the problem by adjusting the size of
    the last fragment.

    It was tested by setting MRBLR to different multiples of
    64, proving correct scatter-gather operation on frames
    with up to 9000 octets in size.

    Signed-off-by: Zefir Kurtisi
    Signed-off-by: David S. Miller

    Zefir Kurtisi
     
  • The eTSEC register MRBLR defines the maximum space in
    the RX buffers and is set to 1536 by gianfar. This
    reasonably covers the common use case where the MTU
    is kept at default 1500. In that case, the largest
    Ethernet frame size of 1518 plus an optional
    GMAC_FCB_LEN of 8, and an additional padding of 8
    to handle FSL_GIANFAR_DEV_HAS_TIMER totals to 1534
    and nicely fit within the chosen MRBLR.

    Alas, if the eTSEC is attached to a DSA enabled switch,
    the (E)DSA header extension (4 or 8 bytes) causes every
    maximum sized frame to be fragmented by the hardware.

    This patch increases the maximum RX buffer size by 8
    and rounds up to the next multiple of 64, which the
    hardware's defines as RX buffer granularity.

    Signed-off-by: Zefir Kurtisi
    Signed-off-by: David S. Miller

    Zefir Kurtisi
     
  • Laura tracked poll() [and friends] regression caused by commit
    e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

    udp_poll() needs to know if there is a valid packet in receive queue,
    even if its payload length is 0.

    Change first_packet_length() to return an signed int, and use -1
    as the indication of an empty queue.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Laura Abbott
    Signed-off-by: Eric Dumazet
    Tested-by: Laura Abbott
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2016

12 commits

  • Encoding of the metadata was using the padded length as opposed to
    the real length of the data which is a bug per specification.
    This has not been an issue todate because all metadatum specified
    so far has been 32 bit where aligned and data length are the same width.
    This also includes a bug fix for validating the length of a u16 field.
    But since there is no metadata of size u16 yes we are fine to include it
    here.

    While at it get rid of magic numbers.

    Fixes: ef6980b6becb ("net sched: introduce IFE action")
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Driver never bothered marking the VF's vport with the VF's sw_fid.
    As a result, FLR flows are not going to clean those vports.

    If the vport was active when FLRed, re-activating it would lead
    to a FW assertion.

    Fixes: dacd88d6f6851 ("qed: IOV l2 functionality")
    Signed-off-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Yuval Mintz
     
  • In b8247f095e,

    "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"

    gso skbs arriving from an ingress interface that go through UDP
    tunneling, are allowed to be fragmented if the resulting encapulated
    segments exceed the dst mtu of the egress interface.

    This aligned the behavior of gso skbs to non-gso skbs going through udp
    encapsulation path.

    However the non-gso vs gso anomaly is present also in the following
    cases of a GRE tunnel:
    - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
    (e.g. OvS vport-gre with df_default=false)
    - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set

    In both of the above cases, the non-gso skbs get fragmented, whereas the
    gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
    as they don't go through the segment+fragment code path.

    Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.

    Tunnels that do set IP_DF, will not go to fragmentation of segments.
    This preserves behavior of ip_gre in (the default) pmtudisc mode.

    Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
    Reported-by: wenxu
    Cc: Hannes Frederic Sowa
    Signed-off-by: Shmulik Ladkani
    Tested-by: wenxu
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     
  • If DAD fails with accept_dad set to 2, global addresses and host routes
    are incorrectly left in place. Even though disable_ipv6 is set,
    contrary to documentation, the addresses are not dynamically deleted
    from the interface. It is only on a subsequent link down/up that these
    are removed. The fix is not only to set the disable_ipv6 flag, but
    also to call addrconf_ifdown(), which is the action to carry out when
    disabling IPv6. This results in the addresses and routes being deleted
    immediately. The DAD failure for the LL addr is determined as before
    via netlink, or by the absence of the LL addr (which also previously
    would have had to be checked for in case of an intervening link down
    and up). As the call to addrconf_ifdown() requires an rtnl lock, the
    logic to disable IPv6 when DAD fails is moved to addrconf_dad_work().

    Previous behavior:

    root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
    net.ipv6.conf.eth3.accept_dad = 2
    root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
    root@vm1:/# ip link set up eth3
    root@vm1:/# ip -6 addr show dev eth3
    5: eth3: mtu 1500 qlen 1000
    inet6 2000::10/64 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed
    valid_lft forever preferred_lft forever
    root@vm1:/# ip -6 route show dev eth3
    2000::/64 proto kernel metric 256
    fe80::/64 proto kernel metric 256
    root@vm1:/# ip link set down eth3
    root@vm1:/# ip link set up eth3
    root@vm1:/# ip -6 addr show dev eth3
    root@vm1:/# ip -6 route show dev eth3
    root@vm1:/#

    New behavior:

    root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
    net.ipv6.conf.eth3.accept_dad = 2
    root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
    root@vm1:/# ip link set up eth3
    root@vm1:/# ip -6 addr show dev eth3
    root@vm1:/# ip -6 route show dev eth3
    root@vm1:/#

    Signed-off-by: Mike Manning
    Signed-off-by: David S. Miller

    Mike Manning
     
  • Fixes these compiler warnings via libc-compat.h when glibc netipx/ipx.h is
    included before linux/ipx.h:

    ./linux/ipx.h:9:8: error: redefinition of ‘struct sockaddr_ipx’
    ./linux/ipx.h:26:8: error: redefinition of ‘struct ipx_route_definition’
    ./linux/ipx.h:32:8: error: redefinition of ‘struct ipx_interface_definition’
    ./linux/ipx.h:49:8: error: redefinition of ‘struct ipx_config_data’
    ./linux/ipx.h:58:8: error: redefinition of ‘struct ipx_route_def’

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Kernel uapi header are supposed to use them. Fixes userspace compile error:

    linux/openvswitch.h:583:2: error: unknown type name ‘uint32_t’

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compile error:

    error: field ‘real’ has incomplete type
    struct timeval real; /* real (wall-clock) time */

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compiler error:

    error: unknown type name ‘uint32_t’

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compilation errors:

    error: field ‘addr’ has incomplete type
    struct sockaddr_in addr; /* IP address and port to send to */

    error: field ‘addr’ has incomplete type
    struct sockaddr_in6 addr; /* IP address and port to send to */

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compilation errors like:

    error: field ‘addr’ has incomplete type
    struct sockaddr_in addr; /* IP address and port to send to */
    ^
    error: field ‘addr’ has incomplete type
    struct sockaddr_in6 addr; /* IP address and port to send to */

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compilation errors like:

    error: field ‘iph’ has incomplete type
    error: field ‘prefix’ has incomplete type

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     
  • Fixes userspace compilation error:

    error: ‘IFNAMSIZ’ undeclared here (not in a function)

    Signed-off-by: Mikko Rapeli
    Signed-off-by: David S. Miller

    Mikko Rapeli
     

22 Aug, 2016

1 commit


21 Aug, 2016

1 commit


20 Aug, 2016

13 commits

  • 'Commit 3c8b3efc061a ("vmxnet3: allow variable length transmit data ring
    buffer")' changed the size of the buffers in the tx data ring from a
    fixed size of 128 bytes to a variable size.

    However, while copying data to the data ring, vmxnet3_copy_hdr continues
    to carry the old code that assumes fixed buffer size of 128. This patch
    fixes it by adding correct offset based on the actual data ring buffer
    size.

    Signed-off-by: Guolin Yang
    Signed-off-by: Shrikrishna Khare
    Signed-off-by: David S. Miller

    Shrikrishna Khare
     
  • The RAR entry for the SAN MAC address was being cleared when we were
    clearing the VMDq pool bits. In order to prevent this we need to add
    an extra check to protect the SAN MAC from being cleared.

    Fixes: 6e982aeae ("ixgbe: Clear stale pool mappings")
    Signed-off-by: Alexander Duyck
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Pool index has to be converted by get_pool helper to work correctly for
    egress pool. In mlxsw the egress pool index starts from 0.

    Fixes: 0f433fa0ecc ("mlxsw: spectrum_buffers: Implement shared buffer configuration")
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • The sk->sk_state is bits flag, so need use bit operation check
    instead of value check.

    Signed-off-by: Gao Feng
    Tested-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Gao Feng
     
  • Because otherwise when crc computation is still needed it's way more
    expensive than on a linear buffer to the point that it affects
    performance.

    It's so expensive that netperf test gives a perf output as below:

    Overhead Command Shared Object Symbol
    18,62% netserver [kernel.vmlinux] [k] crc32_generic_shift
    2,57% netserver [kernel.vmlinux] [k] __pskb_pull_tail
    1,94% netserver [kernel.vmlinux] [k] fib_table_lookup
    1,90% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string
    1,66% swapper [kernel.vmlinux] [k] intel_idle
    1,63% netserver [kernel.vmlinux] [k] _raw_spin_lock
    1,59% netserver [sctp] [k] sctp_packet_transmit
    1,55% netserver [kernel.vmlinux] [k] memcpy_erms
    1,42% netserver [sctp] [k] sctp_rcv

    # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000
    SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET
    Recv Send Send Utilization Service Demand
    Socket Socket Message Elapsed Send Recv Send Recv
    Size Size Size Time Throughput local remote local remote
    bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

    212992 212992 12000 10.00 3016.42 2.88 3.78 1.874 2.462

    After patch:
    Overhead Command Shared Object Symbol
    2,75% netserver [kernel.vmlinux] [k] memcpy_erms
    2,63% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string
    2,39% netserver [kernel.vmlinux] [k] fib_table_lookup
    2,04% netserver [kernel.vmlinux] [k] __pskb_pull_tail
    1,91% netserver [kernel.vmlinux] [k] _raw_spin_lock
    1,91% netserver [sctp] [k] sctp_packet_transmit
    1,72% netserver [mlx4_en] [k] mlx4_en_process_rx_cq
    1,68% netserver [sctp] [k] sctp_rcv

    # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000
    SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET
    Recv Send Send Utilization Service Demand
    Socket Socket Message Elapsed Send Recv Send Recv
    Size Size Size Time Throughput local remote local remote
    bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

    212992 212992 12000 10.00 3681.77 3.83 3.46 2.045 1.849

    Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize")
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • Saeed Mahameed says:

    ====================
    Mellanox 100G mlx5 fixes 2016-08-16

    This series includes some bug fixes for mlx5e driver.

    From Saeed and Tariq, Optimize MTU change to not reset when it is not required.

    From Paul, Command interface message length check to speedup firmware
    command preparation.

    From Mohamad, Save pci state when pci error is detected.

    From Amir, Flow counters "lastuse" update fix.

    From Hadar, Use correct flow dissector key on flower offloading.
    Plus a small optimization for switchdev hardware id query.

    From Or, three patches to address some E-Switch offloads issues.

    For -stable of 4.6.y and 4.7.y:
    net/mlx5e: Use correct flow dissector key on flower offloading
    net/mlx5: Fix pci error recovery flow
    net/mlx5: Added missing check of msg length in verifying its signature
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When we are in the switchdev/offloads mode, HW matching is done as
    dictated by the offloaded rules and hence we don't need to enable
    the ACLs mechanism used by the legacy mode.

    Signed-off-by: Or Gerlitz
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • While adding actual offloading support to the new switchdev mode, we didn't
    change the setup of the send-to-vport rules to put them in the slow path
    table, fix that.

    Fixes: 1033665e63b6 ('net/mlx5: E-Switch, Use two priorities for SRIOV offloads mode')
    Signed-off-by: Or Gerlitz
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • Since mlx5 has also the NONE e-switch mode, we must translate from mlx5
    mode to devlink mode on the devlink eswitch mode get call, do that.

    While here, remove the mlx5_ prefix from the static function helpers
    that deal with the mode to comply with the rest of the code.

    Fixes: c930a3ad7453 ('net/mlx5e: Add devlink based SRIOV mode change')
    Signed-off-by: Or Gerlitz
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • Avoid firmware command execution each time the switchdev HW ID attr get
    call is made. We do that by reading the ID (PF NIC MAC) only once at
    load time and store it on the representor structure.

    Signed-off-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Hadar Hen Zion
     
  • The wrong key is used when extracting the address type field set by
    the flower offload code. We have to use the control key and not the
    basic key, fix that.

    Fixes: e3a2b7ed018e ('net/mlx5e: Support offload cls_flower with drop action')
    Signed-off-by: Hadar Hen Zion
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Hadar Hen Zion
     
  • Set lastuse statistic, when number of packets is changed compared to
    last query. This was wrongly dropped when bulk counter reading was added.

    Fixes: a351a1b03bf1 ('net/mlx5: Introduce bulk reading of flow counters')
    Signed-off-by: Amir Vadai
    Reported-by: Paul Blakey
    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Amir Vadai
     
  • Set and verify signature calculates the signature for each of the
    mailbox nodes, even for those that are unused (from cache). Added
    a missing length check to set and verify only those which are used.

    While here, also moved the setting of msg's nodes token to where we
    already go over them. This saves a pass because checksum is disabled,
    and the only useful thing remaining that set signature does is setting
    the token.

    Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB
    adapters')
    Signed-off-by: Paul Blakey

    Signed-off-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Paul Blakey