28 Dec, 2016

1 commit

  • Different namespaces might have different requirements to reuse
    TIME-WAIT sockets for new connections. This might be required in
    cases where different namespace applications are in place which
    require TIME_WAIT socket connections to be reduced independently
    of the host.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     

27 Dec, 2016

2 commits

  • Commit beb0babfb77e ("korina: disable napi on close and restart")
    introduced calls to napi_disable() that were missing before,
    unfortunately this leaves a small window during which NAPI has a chance
    to run, yet we just freed resources since korina_free_ring() has been
    called:

    Fix this by disabling NAPI first then freeing resource, and make sure
    that we also cancel the restart task before doing the resource freeing.

    Fixes: beb0babfb77e ("korina: disable napi on close and restart")
    Reported-by: Alexandros C. Couloumbis
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Shahar reported a soft lockup in tc_classify(), where we run into an
    endless loop when walking the classifier chain due to tp->next == tp
    which is a state we should never run into. The issue only seems to
    trigger under load in the tc control path.

    What happens is that in tc_ctl_tfilter(), thread A allocates a new
    tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
    with it. In that classifier callback we had to unlock/lock the rtnl
    mutex and returned with -EAGAIN. One reason why we need to drop there
    is, for example, that we need to request an action module to be loaded.

    This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
    after we loaded and found the requested action, we need to redo the
    whole request so we don't race against others. While we had to unlock
    rtnl in that time, thread B's request was processed next on that CPU.
    Thread B added a new tp instance successfully to the classifier chain.
    When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
    and destroying its tp instance which never got linked, we goto replay
    and redo A's request.

    This time when walking the classifier chain in tc_ctl_tfilter() for
    checking for existing tp instances we had a priority match and found
    the tp instance that was created and linked by thread B. Now calling
    again into tp->ops->change() with that tp was successful and returned
    without error.

    tp_created was never cleared in the second round, thus kernel thinks
    that we need to link it into the classifier chain (once again). tp and
    *back point to the same object due to the match we had earlier on. Thus
    for thread B's already public tp, we reset tp->next to tp itself and
    link it into the chain, which eventually causes the mentioned endless
    loop in tc_classify() once a packet hits the data path.

    Fix is to clear tp_created at the beginning of each request, also when
    we replay it. On the paths that can cause -EAGAIN we already destroy
    the original tp instance we had and on replay we really need to start
    from scratch. It seems that this issue was first introduced in commit
    12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
    and avoid kernel panic when we use cls_cgroup").

    Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
    Reported-by: Shahar Klein
    Signed-off-by: Daniel Borkmann
    Cc: Cong Wang
    Acked-by: Eric Dumazet
    Tested-by: Shahar Klein
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Dec, 2016

33 commits

  • The user prio field is wrong (and overflows) in the XDP forward
    flow.
    This is a result of a bad value for num_tx_rings_p_up, which should
    account all XDP TX rings, as they operate for the same user prio.

    Signed-off-by: Tariq Toukan
    Reported-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Tariq Toukan
     
  • In commit 6f00089c7372 ("tipc: remove SS_DISCONNECTING state") the
    check for socket type is in the wrong place, causing a closing socket
    to always send out a FIN message even when the socket was never
    connected. This is normally harmless, since the destination node for
    such messages most often is zero, and the message will be dropped, but
    it is still a wrong and confusing behavior.

    We fix this in this commit.

    Reviewed-by: Parthasarathy Bhuvaragan
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • In an IPvlan setup when master is set in loopback mode e.g.

    ethtool -K eth0 set loopback on

    where eth0 is master device for IPvlan setup.

    The failure is caused by the faulty logic that determines if the
    packet is from TX-path vs. RX-path by just looking at the mac-
    addresses on the packet while processing multicast packets.

    In the loopback-mode where this crash was happening, the packets
    that are sent out are reflected by the NIC and are processed on
    the RX path, but mac-address check tricks into thinking this
    packet is from TX path and falsely uses dev_forward_skb() to pass
    packets to the slave (virtual) devices.

    This patch records the path while queueing packets and eliminates
    logic of looking at mac-addresses for the same decision.

    ------------[ cut here ]------------
    kernel BUG at include/linux/skbuff.h:1737!
    Call Trace:
    [] dev_forward_skb+0x92/0xd0
    [] ipvlan_process_multicast+0x395/0x4c0 [ipvlan]
    [] ? ipvlan_process_multicast+0xd7/0x4c0 [ipvlan]
    [] ? process_one_work+0x147/0x660
    [] process_one_work+0x1a9/0x660
    [] ? process_one_work+0x147/0x660
    [] worker_thread+0x11d/0x360
    [] ? rescuer_thread+0x350/0x350
    [] kthread+0xdb/0xe0
    [] ? _raw_spin_unlock_irq+0x30/0x50
    [] ? flush_kthread_worker+0xc0/0xc0
    [] ret_from_fork+0x9a/0xd0
    [] ? flush_kthread_worker+0xc0/0xc0

    Fixes: ba35f8588f47 ("ipvlan: Defer multicast / broadcast processing to a work-queue")
    Signed-off-by: Mahesh Bandewar
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     
  • 1) netif_rx() / dev_forward_skb() should not be called from process
    context.

    2) ipvlan_count_rx() should be called with preemption disabled.

    3) We should check if ipvlan->dev is up before feeding packets
    to netif_rx()

    4) We need to prevent device from disappearing if some packets
    are in the multicast backlog.

    5) One kfree_skb() should be a consume_skb() eventually

    Fixes: ba35f8588f47 ("ipvlan: Defer multicast / broadcast processing to
    a work-queue")
    Signed-off-by: Eric Dumazet
    Cc: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull networking fixes from David Miller:

    1) We have to be careful to not try and place a checksum after the end
    of a rawv6 packet, fix from Dave Jones with help from Hannes
    Frederic Sowa.

    2) Missing memory barriers in tcp_tasklet_func() lead to crashes, from
    Eric Dumazet.

    3) Several bug fixes for the new XDP support in virtio_net, from Jason
    Wang.

    4) Increase headroom in RX skbs in be2net driver to accomodate
    encapsulations such as geneve. From Kalesh A P.

    5) Fix SKB frag unmapping on TX in mvpp2, from Thomas Petazzoni.

    6) Pre-pulling UDP headers created a regression in RECVORIGDSTADDR
    socket option support, from Willem de Bruijn.

    7) UID based routing added a potential OOPS in ip_do_redirect() when we
    see an SKB without a socket attached. We just need it for the
    network namespace which we can get from skb->dev instead. Fix from
    Lorenzo Colitti.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
    sctp: fix recovering from 0 win with small data chunks
    sctp: do not loose window information if in rwnd_over
    virtio-net: XDP support for small buffers
    virtio-net: remove big packet XDP codes
    virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
    virtio-net: make rx buf size estimation works for XDP
    virtio-net: unbreak csumed packets for XDP_PASS
    virtio-net: correctly handle XDP_PASS for linearized packets
    virtio-net: fix page miscount during XDP linearizing
    virtio-net: correctly xmit linearized page on XDP_TX
    virtio-net: remove the warning before XDP linearizing
    mlxsw: spectrum_router: Correctly remove nexthop groups
    mlxsw: spectrum_router: Don't reflect dead neighs
    neigh: Send netevent after marking neigh as dead
    ipv6: handle -EFAULT from skb_copy_bits
    inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets
    net/sched: cls_flower: Mandate mask when matching on flags
    net/sched: act_tunnel_key: Fix setting UDP dst port in metadata under IPv6
    stmmac: CSR clock configuration fix
    net: ipv4: Don't crash if passing a null sk to ip_do_redirect.
    ...

    Linus Torvalds
     
  • Currently if SCTP closes the receive window with window pressure, mostly
    caused by excessive skb overhead on payload/overheads ratio, SCTP will
    close the window abruptly while saving the delta on rwnd_press. It will
    start recovering rwnd as the chunks are consumed by the application and
    the rwnd_press will be only recovered after rwnd reach the same value as
    of rwnd_press, mostly to prevent silly window syndrome.

    Thing is, this is very inefficient with small data chunks, as with those
    it will never reach back that value, and thus it will never recover from
    such pressure. This means that we will not issue window updates when
    recovering from 0 window and will rely on a sender retransmit to notice
    it.

    The fix here is to remove such threshold, as no value is good enough: it
    depends on the (avg) chunk sizes being used.

    Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by
    sending SIGSTOP to netserver, sleep 1.2, and SIGCONT.
    Rate limited to 845kbps, for visibility. Capture done at netserver side.

    Previously:
    01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [
    01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS
    01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
    01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS
    01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS
    01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    (window update missed)
    04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g
    04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
    04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS
    04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g

    After:
    02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153]
    02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S
    02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
    02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S
    02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S
    02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508]
    03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017]
    03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526]
    03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035]
    03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544]
    03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053]
    03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562]
    (following data transmission triggered by window updates above)
    03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000]
    03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
    03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S
    03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894]
    03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S
    03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • It's possible that we receive a packet that is larger than current
    window. If it's the first packet in this way, it will cause it to
    increase rwnd_over. Then, if we receive another data chunk (specially as
    SCTP allows you to have one data chunk in flight even during 0 window),
    rwnd_over will be overwritten instead of added to.

    In the long run, this could cause the window to grow bigger than its
    initial size, as rwnd_over would be charged only for the last received
    data chunk while the code will try open the window for all packets that
    were received and had its value in rwnd_over overwritten. This, then,
    can lead to the worsening of payload/buffer ratio and cause rwnd_press
    to kick in more often.

    The fix is to sum it too, same as is done for rwnd_press, so that if we
    receive 3 chunks after closing the window, we still have to release that
    same amount before re-opening it.

    Log snippet from sctp_test exhibiting the issue:
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • Pull final vfs updates from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sg_write()/bsg_write() is not fit to be called under KERNEL_DS
    ufs: fix function declaration for ufs_truncate_blocks
    fs: exec: apply CLOEXEC before changing dumpable task flags
    seq_file: reset iterator to first record for zero offset
    vfs: fix isize/pos/len checks for reflink & dedupe
    [iov_iter] fix iterate_all_kinds() on empty iterators
    move aio compat to fs/aio.c
    reorganize do_make_slave()
    clone_private_mount() doesn't need to touch namespace_sem
    remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()

    Linus Torvalds
     
  • Jason Wang says:

    ====================
    several fixups for virtio-net XDP

    Merry Xmas and a Happy New year to all:

    This series tries to fixes several issues for virtio-net XDP which
    could be categorized into several parts:

    - fix several issues during XDP linearizing
    - allow csumed packet to work for XDP_PASS
    - make EWMA rxbuf size estimation works for XDP
    - forbid XDP when GUEST_UFO is support
    - remove big packet XDP support
    - add XDP support or small buffer

    Please see individual patches for details.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
    small receive buffer untouched. This will confuse the user who want to
    set XDP but use small buffers. Other than forbid XDP in small buffer
    mode, let's make it work. XDP then can only work at skb->data since
    virtio-net create skbs during refill, this is sub optimal which could
    be optimized in the future.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Now we in fact don't allow XDP for big packets, remove its codes.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
    packet that exceeds a single page which could not be handled
    correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
    supported. While at it, forbid XDP for ECN (which comes only from GRO)
    too to prevent user from misconfiguration.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • We don't update ewma rx buf size in the case of XDP. This will lead
    underestimation of rx buf size which causes host to produce more than
    one buffers. This will greatly increase the possibility of XDP page
    linearization.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • We drop csumed packet when do XDP for packets. This breaks
    XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag
    to be set. With this patch, simple TCP works for XDP_PASS.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • When XDP_PASS were determined for linearized packets, we try to get
    new buffers in the virtqueue and build skbs from them. This is wrong,
    we should create skbs based on existed buffers instead. Fixing them by
    creating skb based on xdp_page.

    With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • We don't put page during linearizing, the would cause leaking when
    xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
    put page accordingly. Also decrease the number of buffers during
    linearizing to make sure caller can free buffers correctly when packet
    exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
    huge number of packets.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • After we linearize page, we should xmit this page instead of the page
    of first buffer which may lead unexpected result. With this patch, we
    can see correct packet during XDP_TX.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Since we use EWMA to estimate the size of rx buffer. When rx buffer
    size is underestimated, it's usual to have a packet with more than one
    buffers. Consider this is not a bug, remove the warning and correct
    the comment before XDP linearizing.

    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Pull befs updates from Luis de Bethencourt:
    "A series of small fixes and adding NFS export support"

    * tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs:
    befs: add NFS export support
    befs: remove trailing whitespaces
    befs: remove signatures from comments
    befs: fix style issues in header files
    befs: fix style issues in linuxvfs.c
    befs: fix typos in linuxvfs.c
    befs: fix style issues in io.c
    befs: fix style issues in inode.c
    befs: fix style issues in debug.c

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "Some fixes came in while I was out, mostly intel and amdgpu ones, with
    one ast fix"

    Daniel Vetter says:
    "This should also shut up the WARN_ON(!intel_dp->lane_count) noise"

    * tag 'drm-fixes-for-4.10-rc1' of git://people.freedesktop.org/~airlied/linux: (35 commits)
    drm/amdgpu: update tile table for oland/hainan
    drm/amdgpu: update tile table for verde
    drm/amdgpu: update rev id for verde
    drm/amdgpu: update golden setting for verde
    drm/amdgpu: update rev id for oland
    drm/amdgpu: update golden setting for oland
    drm/amdgpu: update rev id for hainan
    drm/amdgpu: update golden setting for hainan
    drm/amdgpu: update rev id for pitcairn
    drm/amdgpu: update golden setting for pitcairn
    drm/amdgpu: update golden setting/tiling table of tahiti
    drm/i915: skip the first 4k of stolen memory on everything >= gen8
    drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
    drm/i915: Fix use after free in logical_render_ring_init
    drm/i915: disable PSR by default on HSW/BDW
    drm/i915: Fix setting of boost freq tunable
    drm/i915: tune down the fast link training vs boot fail
    drm/i915: Reorder phys backing storage release
    drm/i915/gen9: Fix PCODE polling during SAGV disabling
    drm/i915/gen9: Fix PCODE polling during CDCLK change notification
    ...

    Linus Torvalds
     
  • Pull rdma fixes from Doug Ledford:
    "First round of -rc fixes for 4.10 kernel:

    - a series of qedr fixes
    - a series of rxe fixes
    - one i40iw fix
    - one cma fix
    - one cxgb4 fix"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
    IB/rxe: Don't check for null ptr in send()
    IB/rxe: Drop future atomic/read packets rather than retrying
    IB/rxe: Use BTH_PSN_MASK when ACKing duplicate sends
    qedr: Always notify the verb consumer of flushed CQEs
    qedr: clear the vendor error field in the work completion
    qedr: post_send/recv according to QP state
    qedr: ignore inline flag in read verbs
    qedr: modify QP state to error when destroying it
    qedr: return correct value on modify qp
    qedr: return error if destroy CQ failed
    qedr: configure the number of CQEs on CQ creation
    i40iw: Set 128B as the only supported RQ WQE size
    IB/cma: Fix a race condition in iboe_addr_get_sgid()
    IB/rxe: Fix a memory leak in rxe_qp_cleanup()
    iw_cxgb4: set correct FetchBurstMax for QPs

    Linus Torvalds
     
  • Pull late SCSI updates from James Bottomley:
    "This is mostly stuff which missed the initial pull.

    There's a new driver: qedi, and some ufs, ibmvscsis and ncr5380
    updates plus some assorted driver fixes and also a fix for the bug
    where if a device goes into a blocked state between configuration and
    sysfs device add (which can be a long time under async probing) it
    would become permanently blocked"

    * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (30 commits)
    scsi: avoid a permanent stop of the scsi device's request queue
    scsi: mpt3sas: Recognize and act on iopriority info
    scsi: qla2xxx: Fix Target mode handling with Multiqueue changes.
    scsi: qla2xxx: Add Block Multi Queue functionality.
    scsi: qla2xxx: Add multiple queue pair functionality.
    scsi: qla2xxx: Utilize pci_alloc_irq_vectors/pci_free_irq_vectors calls.
    scsi: qla2xxx: Only allow operational MBX to proceed during RESET.
    scsi: hpsa: remove memory allocate failure message
    scsi: Update 3ware driver email addresses
    scsi: zfcp: fix rport unblock race with LUN recovery
    scsi: zfcp: do not trace pure benign residual HBA responses at default level
    scsi: zfcp: fix use-after-"free" in FC ingress path after TMF
    scsi: libcxgbi: return error if interface is not up
    scsi: cxgb4i: libcxgbi: add missing module_put()
    scsi: cxgb4i: libcxgbi: cxgb4: add T6 iSCSI completion feature
    scsi: cxgb4i: libcxgbi: add active open cmd for T6 adapters
    scsi: cxgb4i: use cxgb4_tp_smt_idx() to get smt_idx
    scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.
    scsi: aacraid: remove wildcard for series 9 controllers
    scsi: ibmvscsi: add write memory barrier to CRQ processing
    ...

    Linus Torvalds
     
  • Pull more ARC updates from Vineet Gupta:

    - Fix for aliasing VIPT dcache in old ARC700 cores

    - micro-optimization in ARC700 ProtV handler

    - Enable SG_CHAIN [Vladimir]

    - ARC HS38 core intc default to prio 1

    * tag 'arc-4.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
    ARC: mm: arc700: Don't assume 2 colours for aliasing VIPT dcache
    ARC: mm: No need to save cache version in @cpuinfo
    ARC: enable SG chaining
    ARCv2: intc: default all interrupts to priority 1
    ARCv2: entry: document intr disable in hard isr
    ARC: ARCompact entry: elide re-reading ECR in ProtV handler

    Linus Torvalds
     
  • Jiri Pirko says:

    ====================
    mlxsw: Router fixes

    Ido says:

    First two patches ensure we remove from the device's table neighbours
    that are considered to be dead by the neighbour core.

    The last patch removes nexthop groups from the device when they are no
    longer valid.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • At the end of the nexthop initialization process we determine whether
    the nexthop should be offloaded or not based on the NUD state of the
    neighbour representing it. After all the nexthops were initialized we
    refresh the nexthop group and potentially offload it to the device, in
    case some of the nexthops were resolved.

    Make the destruction of a nexthop group symmetric with its creation by
    marking all nexthops as invalid and then refresh the nexthop group to
    make sure it was removed from the device's tables.

    Fixes: b2157149b0b0 ("mlxsw: spectrum_router: Add the nexthop neigh activity update")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When a neighbour is considered to be dead, we should remove it from the
    device's table regardless of its NUD state.

    Without this patch, after setting a port to be administratively down we
    get the following errors when we periodically try to update the kernel
    about neighbours activity:

    [ 461.947268] mlxsw_spectrum 0000:03:00.0 sw1p3: Failed to find
    matching neighbour for IP=192.168.100.2

    Fixes: a6bf9e933daf ("mlxsw: spectrum_router: Offload neighbours based on NUD state change")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • neigh_cleanup_and_release() is always called after marking a neighbour
    as dead, but it only notifies user space and not in-kernel listeners of
    the netevent notification chain.

    This can cause multiple problems. In my specific use case, it causes the
    listener (a switch driver capable of L3 offloads) to believe a neighbour
    entry is still valid, and is thus erroneously kept in the device's
    table.

    Fix that by sending a netevent after marking the neighbour as dead.

    Fixes: a6bf9e933daf ("mlxsw: spectrum_router: Offload neighbours based on NUD state change")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • By setting certain socket options on ipv6 raw sockets, we can confuse the
    length calculation in rawv6_push_pending_frames triggering a BUG_ON.

    RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
    RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282
    RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
    RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
    RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
    R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
    R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

    Call Trace:
    [] ? unmap_page_range+0x693/0x830
    [] inet_sendmsg+0x67/0xa0
    [] sock_sendmsg+0x38/0x50
    [] SYSC_sendto+0xef/0x170
    [] SyS_sendto+0xe/0x10
    [] do_syscall_64+0x50/0xa0
    [] entry_SYSCALL64_slow_path+0x25/0x25

    Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 504

    int main(int argc, char* argv[])
    {
    int fd;
    int zero = 0;
    char buf[LEN];

    memset(buf, 0, LEN);

    fd = socket(AF_INET6, SOCK_RAW, 7);

    setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
    setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);

    sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
    }

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     
  • Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
    the packet. For sockets that have transport headers pulled, transport
    offset can be negative. Use signed comparison to avoid overflow.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Nisar Jagabar
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Or Gerlitz says:

    ====================
    net/sched fixes for cls_flower and act_tunnel_key

    This small series contain a fix to the matching flags support
    in flower and to the tunnel key action MD prep for IPv6.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When matching on flags, we should require the user to provide the
    mask and avoid using an all-ones mask. Not doing so causes matching
    on flags provided w.o mask to hit on the value being unset for all
    flags, which may not what the user wanted to happen.

    Fixes: faa3ffce7829 ('net/sched: cls_flower: Add support for matching on flags')
    Signed-off-by: Or Gerlitz
    Reported-by: Paul Blakey
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • The UDP dst port was provided to the helper function which sets the
    IPv6 IP tunnel meta-data under a wrong param order, fix that.

    Fixes: 75bfbca01e48 ('net/sched: act_tunnel_key: Add UDP dst port option')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • When testing stmmac with my QoS reference design I checked a problem in the
    CSR clock configuration that was impossibilitating the phy discovery, since
    every read operation returned 0x0000ffff. This patch fixes the issue.

    Signed-off-by: Joao Pinto
    Signed-off-by: David S. Miller

    jpinto
     

23 Dec, 2016

4 commits

  • Al Viro
     
  • Both damn things interpret userland pointers embedded into the payload;
    worse, they are actually traversing those. Leaving aside the bad
    API design, this is very much _not_ safe to call with KERNEL_DS.
    Bail out early if that happens.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • sparse says:

    fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static?

    Note that the forward declaration in the file is already marked static.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • If you have a process that has set itself to be non-dumpable, and it
    then undergoes exec(2), any CLOEXEC file descriptors it has open are
    "exposed" during a race window between the dumpable flags of the process
    being reset for exec(2) and CLOEXEC being applied to the file
    descriptors. This can be exploited by a process by attempting to access
    /proc//fd/... during this window, without requiring CAP_SYS_PTRACE.

    The race in question is after set_dumpable has been (for get_link,
    though the trace is basically the same for readlink):

    [vfs]
    -> proc_pid_link_inode_operations.get_link
    -> proc_pid_get_link
    -> proc_fd_access_allowed
    -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);

    Which will return 0, during the race window and CLOEXEC file descriptors
    will still be open during this window because do_close_on_exec has not
    been called yet. As a result, the ordering of these calls should be
    reversed to avoid this race window.

    This is of particular concern to container runtimes, where joining a
    PID namespace with file descriptors referring to the host filesystem
    can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
    against access of CLOEXEC file descriptors -- file descriptors which may
    reference filesystem objects the container shouldn't have access to).

    Cc: dev@opencontainers.org
    Cc: # v3.2+
    Reported-by: Michael Crosby
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai