04 Jan, 2017

1 commit


03 Jan, 2017

3 commits

  • In the case of custom rules being present we need to handle the case of the
    LOCAL table being intialized after the new rule has been added. To address
    that I am adding a new check so that we can make certain we don't use an
    alias of MAIN for LOCAL when allocating a new table.

    Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
    Reported-by: Oliver Brunel
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • 5.2. Action on Reception of a Query

    When a system receives a Query, it does not respond immediately.
    Instead, it delays its response by a random amount of time, bounded
    by the Max Resp Time value derived from the Max Resp Code in the
    received Query message. A system may receive a variety of Queries on
    different interfaces and of different kinds (e.g., General Queries,
    Group-Specific Queries, and Group-and-Source-Specific Queries), each
    of which may require its own delayed response.

    Before scheduling a response to a Query, the system must first
    consider previously scheduled pending responses and in many cases
    schedule a combined response. Therefore, the system must be able to
    maintain the following state:

    o A timer per interface for scheduling responses to General Queries.

    o A per-group and interface timer for scheduling responses to Group-
    Specific and Group-and-Source-Specific Queries.

    o A per-group and interface list of sources to be reported in the
    response to a Group-and-Source-Specific Query.

    When a new Query with the Router-Alert option arrives on an
    interface, provided the system has state to report, a delay for a
    response is randomly selected in the range (0, [Max Resp Time]) where
    Max Resp Time is derived from Max Resp Code in the received Query
    message. The following rules are then used to determine if a Report
    needs to be scheduled and the type of Report to schedule. The rules
    are considered in order and only the first matching rule is applied.

    1. If there is a pending response to a previous General Query
    scheduled sooner than the selected delay, no additional response
    needs to be scheduled.

    2. If the received Query is a General Query, the interface timer is
    used to schedule a response to the General Query after the
    selected delay. Any previously pending response to a General
    Query is canceled.
    --8
    Signed-off-by: David S. Miller

    Michal Tesar
     
  • __skb_flow_dissect can be called with a skb or a data packet, either
    can be NULL. All calls seems to have been moved to __skb_header_pointer
    except the pptp handling which is still calling skb_header_pointer.

    skb_header_pointer will use skb->data and thus:
    [ 109.556866] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    [ 109.557102] IP: [] __skb_flow_dissect+0xa88/0xce0
    [ 109.557263] PGD 0
    [ 109.557338]
    [ 109.557484] Oops: 0000 [#1] SMP
    [ 109.557562] Modules linked in: chaoskey
    [ 109.557783] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.0 #79
    [ 109.557867] Hardware name: Supermicro A1SRM-LN7F/LN5F/A1SRM-LN7F-2758, BIOS 1.0c 11/04/2015
    [ 109.557957] task: ffff94085c27bc00 task.stack: ffffb745c0068000
    [ 109.558041] RIP: 0010:[] [] __skb_flow_dissect+0xa88/0xce0
    [ 109.558203] RSP: 0018:ffff94087fc83d40 EFLAGS: 00010206
    [ 109.558286] RAX: 0000000000000130 RBX: ffffffff8975bf80 RCX: ffff94084fab6800
    [ 109.558373] RDX: 0000000000000010 RSI: 000000000000000c RDI: 0000000000000000
    [ 109.558460] RBP: 0000000000000b88 R08: 0000000000000000 R09: 0000000000000022
    [ 109.558547] R10: 0000000000000008 R11: ffff94087fc83e04 R12: 0000000000000000
    [ 109.558763] R13: ffff94084fab6800 R14: ffff94087fc83e04 R15: 000000000000002f
    [ 109.558979] FS: 0000000000000000(0000) GS:ffff94087fc80000(0000) knlGS:0000000000000000
    [ 109.559326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 109.559539] CR2: 0000000000000080 CR3: 0000000281809000 CR4: 00000000001026e0
    [ 109.559753] Stack:
    [ 109.559957] 000000000000000c ffff94084fab6822 0000000000000001 ffff94085c2b5fc0
    [ 109.560578] 0000000000000001 0000000000002000 0000000000000000 0000000000000000
    [ 109.561200] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    [ 109.561820] Call Trace:
    [ 109.562027]
    [ 109.562108] [] ? eth_get_headlen+0x7a/0xf0
    [ 109.562522] [] ? igb_poll+0x96a/0xe80
    [ 109.562737] [] ? net_rx_action+0x20b/0x350
    [ 109.562953] [] ? __do_softirq+0xe8/0x280
    [ 109.563169] [] ? irq_exit+0xaa/0xb0
    [ 109.563382] [] ? do_IRQ+0x4b/0xc0
    [ 109.563597] [] ? common_interrupt+0x7f/0x7f
    [ 109.563810]
    [ 109.563890] [] ? cpuidle_enter_state+0x130/0x2c0
    [ 109.564304] [] ? cpuidle_enter_state+0x120/0x2c0
    [ 109.564520] [] ? cpu_startup_entry+0x19f/0x1f0
    [ 109.564737] [] ? start_secondary+0x12a/0x140
    [ 109.564950] Code: 83 e2 20 a8 80 0f 84 60 01 00 00 c7 04 24 08 00
    00 00 66 85 d2 0f 84 be fe ff ff e9 69 fe ff ff 8b 34 24 89 f2 83 c2
    04 66 85 c0 8b 84 24 80 00 00 00 0f 49 d6 41 8d 31 01 d6 41 2b 84
    24 84
    [ 109.569959] RIP [] __skb_flow_dissect+0xa88/0xce0
    [ 109.570245] RSP
    [ 109.570453] CR2: 0000000000000080

    Fixes: ab10dccb1160 ("rps: Inspect PPTP encapsulated by GRE to get flow hash")
    Signed-off-by: Ian Kumlien
    Signed-off-by: David S. Miller

    Ian Kumlien
     

02 Jan, 2017

5 commits

  • In ieee80211_xmit_fast(), 'info' is initialized to point to the skb
    that's passed in, but that skb may later be replaced by a clone (if
    it was shared), leading to an invalid pointer.

    This can lead to use-after-free and also later crashes since the
    real SKB's info->hw_queue doesn't get initialized properly.

    Fix this by assigning info only later, when it's needed, after the
    skb replacement (may have) happened.

    Cc: stable@vger.kernel.org
    Reported-by: Ben Greear
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • For connected sockets, __l2tp_ip{,6}_bind_lookup() needs to check the
    remote IP when looking for a matching socket. Otherwise a connected
    socket can receive traffic not originating from its peer.

    Drop l2tp_ip_bind_lookup() and l2tp_ip6_bind_lookup() instead of
    updating their prototype, as these functions aren't used.

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • An L2TP socket bound to the unspecified address should match with any
    address. If not, it can't receive any packet and __l2tp_ip6_bind_lookup()
    can't prevent another socket from binding on the same device/tunnel ID.

    While there, rename the 'addr' variable to 'sk_laddr' (local addr), to
    make following patch clearer.

    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault
     
  • Update nlmsg_len field with genlmsg_end to enable userspace processing
    using nlmsg_next helper. Also adds error handling.

    Signed-off-by: Reiter Wolfgang
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Reiter Wolfgang
     
  • ->setattr() was recently implemented for socket files to sync the socket
    inode's uid to the new 'sk_uid' member of struct sock. It does this by
    copying over the ia_uid member of struct iattr. However, ia_uid is
    actually only valid when ATTR_UID is set in ia_valid, indicating that
    the uid is being changed, e.g. by chown. Other metadata operations such
    as chmod or utimes leave ia_uid uninitialized. Therefore, sk_uid could
    be set to a "garbage" value from the stack.

    Fix this by only copying the uid over when ATTR_UID is set.

    Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
    Signed-off-by: Eric Biggers
    Tested-by: Lorenzo Colitti
    Acked-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Eric Biggers
     

30 Dec, 2016

4 commits

  • IPv4 output routes already use l3mdev device instead of loopback for dst's
    if it is applicable. Change local input routes to do the same.

    This fixes icmp responses for unreachable UDP ports which are directed
    to the wrong table after commit 9d1a6c4ea43e4 because local_input
    routes use the loopback device. Moving from ingress device to loopback
    loses the L3 domain causing responses based on the dst to get to lost.

    Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to
    determine L3 domain")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • When we send a packet for our own local address on a non-loopback
    interface (e.g. eth0), due to the change had been introduced from
    commit 0b922b7a829c ("net: original ingress device index in PKTINFO"), the
    original ingress device index would be set as the loopback interface.
    However, the packet should be considered as if it is being arrived via the
    sending interface (eth0), otherwise it would break the expectation of the
    userspace application (e.g. the DHCPRELEASE message from dhcp_release
    binary would be ignored by the dnsmasq daemon, since it come from lo which
    is not the interface dnsmasq bind to)

    Fixes: 0b922b7a829c ("net: original ingress device index in PKTINFO")
    Acked-by: David Ahern
    Signed-off-by: Wei Zhang
    Signed-off-by: David S. Miller

    Wei Zhang
     
  • We miss to check if the netlink message is actually big enough to contain
    a struct if_stats_msg.

    Add a check to prevent userland from sending us short messages that would
    make us access memory beyond the end of the message.

    Fixes: 10c9ead9f3c6 ("rtnetlink: add new RTM_GETSTATS message to dump...")
    Signed-off-by: Mathias Krause
    Cc: Roopa Prabhu
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • …_append_data and ip6_finish_output

    There is an inconsistent conditional judgement between __ip6_append_data
    and ip6_finish_output functions, the variable length in __ip6_append_data
    just include the length of application's payload and udp6 header, don't
    include the length of ipv6 header, but in ip6_finish_output use
    (skb->len > ip6_skb_dst_mtu(skb)) as judgement, and skb->len include the
    length of ipv6 header.

    That causes some particular application's udp6 payloads whose length are
    between (MTU - IPv6 Header) and MTU were fragmented by ip6_fragment even
    though the rst->dev support UFO feature.

    Add the length of ipv6 header to length in __ip6_append_data to keep
    consistent conditional judgement as ip6_finish_output for ip6 fragment.

    Signed-off-by: Zheng Li <james.z.li@ericsson.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Zheng Li
     

29 Dec, 2016

2 commits

  • This patch fixes the following warnings when CONFIG_PROC_FS is not set:

    linux/net/atm/lec.c: In function ‘lane_module_cleanup’:
    linux/net/atm/lec.c:1062:27: error: ‘atm_proc_root’ undeclared (first
    use in this function)
    remove_proc_entry("lec", atm_proc_root);
    ^
    linux/net/atm/lec.c:1062:27: note: each undeclared identifier is
    reported only once for each function it appears in

    Signed-off-by: Augusto Mecking Caringi
    Signed-off-by: David S. Miller

    Augusto Mecking Caringi
     
  • Since we now use a non zero mask on addr_type, we are matching on its
    value (IPV4/IPV6). So before this fix, matching on enc_src_ip/enc_dst_ip
    failed in SW/classify path since its value was zero.
    This patch sets the proper value of addr_type for encapsulated packets.

    Fixes: 970bfcd09791 ('net/sched: cls_flower: Use mask for addr_type')
    Signed-off-by: Paul Blakey
    Reviewed-by: Hadar Hen Zion
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Paul Blakey
     

28 Dec, 2016

4 commits

  • Pull networking fixes from David Miller:

    1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.

    The most important is to not assume the packet is RX just because
    the destination address matches that of the device. Such an
    assumption causes problems when an interface is put into loopback
    mode.

    2) If we retry when creating a new tc entry (because we dropped the
    RTNL mutex in order to load a module, for example) we end up with
    -EAGAIN and then loop trying to replay the request. But we didn't
    reset some state when looping back to the top like this, and if
    another thread meanwhile inserted the same tc entry we were trying
    to, we re-link it creating an enless loop in the tc chain. Fix from
    Daniel Borkmann.

    3) There are two different WRITE bits in the MDIO address register for
    the stmmac chip, depending upon the chip variant. Due to a bug we
    could set them both, fix from Hock Leong Kweh.

    4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    net: stmmac: fix incorrect bit set in gmac4 mdio addr register
    r8169: add support for RTL8168 series add-on card.
    net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
    openvswitch: upcall: Fix vlan handling.
    ipv4: Namespaceify tcp_tw_reuse knob
    net: korina: Fix NAPI versus resources freeing
    net, sched: fix soft lockup in tc_classify
    net/mlx4_en: Fix user prio field in XDP forward
    tipc: don't send FIN message from connectionless socket
    ipvlan: fix multicast processing
    ipvlan: fix various issues in ipvlan_process_multicast()

    Linus Torvalds
     
  • After commit 73b62bd085f4737679ea9afc7867fa5f99ba7d1b ("virtio-net:
    remove the warning before XDP linearizing"), there's no users for
    bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for
    commit f23bc46c30ca5ef58b8549434899fcbac41b2cfc.

    Cc: Daniel Borkmann
    Cc: John Fastabend
    Signed-off-by: Jason Wang
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Networking stack accelerate vlan tag handling by
    keeping topmost vlan header in skb. This works as
    long as packet remains in OVS datapath. But during
    OVS upcall vlan header is pushed on to the packet.
    When such packet is sent back to OVS datapath, core
    networking stack might not handle it correctly. Following
    patch avoids this issue by accelerating the vlan tag
    during flow key extract. This simplifies datapath by
    bringing uniform packet processing for packets from
    all code paths.

    Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets").
    CC: Jarno Rajahalme
    CC: Jiri Benc
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    pravin shelar
     
  • Different namespaces might have different requirements to reuse
    TIME-WAIT sockets for new connections. This might be required in
    cases where different namespace applications are in place which
    require TIME_WAIT socket connections to be reduced independently
    of the host.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     

27 Dec, 2016

1 commit

  • Shahar reported a soft lockup in tc_classify(), where we run into an
    endless loop when walking the classifier chain due to tp->next == tp
    which is a state we should never run into. The issue only seems to
    trigger under load in the tc control path.

    What happens is that in tc_ctl_tfilter(), thread A allocates a new
    tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
    with it. In that classifier callback we had to unlock/lock the rtnl
    mutex and returned with -EAGAIN. One reason why we need to drop there
    is, for example, that we need to request an action module to be loaded.

    This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
    after we loaded and found the requested action, we need to redo the
    whole request so we don't race against others. While we had to unlock
    rtnl in that time, thread B's request was processed next on that CPU.
    Thread B added a new tp instance successfully to the classifier chain.
    When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
    and destroying its tp instance which never got linked, we goto replay
    and redo A's request.

    This time when walking the classifier chain in tc_ctl_tfilter() for
    checking for existing tp instances we had a priority match and found
    the tp instance that was created and linked by thread B. Now calling
    again into tp->ops->change() with that tp was successful and returned
    without error.

    tp_created was never cleared in the second round, thus kernel thinks
    that we need to link it into the classifier chain (once again). tp and
    *back point to the same object due to the match we had earlier on. Thus
    for thread B's already public tp, we reset tp->next to tp itself and
    link it into the chain, which eventually causes the mentioned endless
    loop in tc_classify() once a packet hits the data path.

    Fix is to clear tp_created at the beginning of each request, also when
    we replay it. On the paths that can cause -EAGAIN we already destroy
    the original tp instance we had and on replay we really need to start
    from scratch. It seems that this issue was first introduced in commit
    12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
    and avoid kernel panic when we use cls_cgroup").

    Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
    Reported-by: Shahar Klein
    Signed-off-by: Daniel Borkmann
    Cc: Cong Wang
    Acked-by: Eric Dumazet
    Tested-by: Shahar Klein
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

26 Dec, 2016

3 commits

  • No point in going through loops and hoops instead of just comparing the
    values.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

1 commit


24 Dec, 2016

8 commits

  • In commit 6f00089c7372 ("tipc: remove SS_DISCONNECTING state") the
    check for socket type is in the wrong place, causing a closing socket
    to always send out a FIN message even when the socket was never
    connected. This is normally harmless, since the destination node for
    such messages most often is zero, and the message will be dropped, but
    it is still a wrong and confusing behavior.

    We fix this in this commit.

    Reviewed-by: Parthasarathy Bhuvaragan
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Currently if SCTP closes the receive window with window pressure, mostly
    caused by excessive skb overhead on payload/overheads ratio, SCTP will
    close the window abruptly while saving the delta on rwnd_press. It will
    start recovering rwnd as the chunks are consumed by the application and
    the rwnd_press will be only recovered after rwnd reach the same value as
    of rwnd_press, mostly to prevent silly window syndrome.

    Thing is, this is very inefficient with small data chunks, as with those
    it will never reach back that value, and thus it will never recover from
    such pressure. This means that we will not issue window updates when
    recovering from 0 window and will rely on a sender retransmit to notice
    it.

    The fix here is to remove such threshold, as no value is good enough: it
    depends on the (avg) chunk sizes being used.

    Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by
    sending SIGSTOP to netserver, sleep 1.2, and SIGCONT.
    Rate limited to 845kbps, for visibility. Capture done at netserver side.

    Previously:
    01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [
    01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS
    01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
    01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS
    01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS
    01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
    (window update missed)
    04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
    04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g
    04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
    04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS
    04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g

    After:
    02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153]
    02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S
    02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
    02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S
    02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S
    02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
    03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508]
    03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017]
    03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526]
    03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035]
    03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544]
    03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053]
    03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562]
    (following data transmission triggered by window updates above)
    03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
    03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000]
    03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
    03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S
    03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894]
    03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S
    03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • It's possible that we receive a packet that is larger than current
    window. If it's the first packet in this way, it will cause it to
    increase rwnd_over. Then, if we receive another data chunk (specially as
    SCTP allows you to have one data chunk in flight even during 0 window),
    rwnd_over will be overwritten instead of added to.

    In the long run, this could cause the window to grow bigger than its
    initial size, as rwnd_over would be charged only for the last received
    data chunk while the code will try open the window for all packets that
    were received and had its value in rwnd_over overwritten. This, then,
    can lead to the worsening of payload/buffer ratio and cause rwnd_press
    to kick in more often.

    The fix is to sum it too, same as is done for rwnd_press, so that if we
    receive 3 chunks after closing the window, we still have to release that
    same amount before re-opening it.

    Log snippet from sctp_test exhibiting the issue:
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease:
    association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
    [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
    rwnd decreased by 1 to (0, 1, 114221)

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • neigh_cleanup_and_release() is always called after marking a neighbour
    as dead, but it only notifies user space and not in-kernel listeners of
    the netevent notification chain.

    This can cause multiple problems. In my specific use case, it causes the
    listener (a switch driver capable of L3 offloads) to believe a neighbour
    entry is still valid, and is thus erroneously kept in the device's
    table.

    Fix that by sending a netevent after marking the neighbour as dead.

    Fixes: a6bf9e933daf ("mlxsw: spectrum_router: Offload neighbours based on NUD state change")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • By setting certain socket options on ipv6 raw sockets, we can confuse the
    length calculation in rawv6_push_pending_frames triggering a BUG_ON.

    RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
    RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282
    RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
    RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
    RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
    R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
    R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

    Call Trace:
    [] ? unmap_page_range+0x693/0x830
    [] inet_sendmsg+0x67/0xa0
    [] sock_sendmsg+0x38/0x50
    [] SYSC_sendto+0xef/0x170
    [] SyS_sendto+0xe/0x10
    [] do_syscall_64+0x50/0xa0
    [] entry_SYSCALL64_slow_path+0x25/0x25

    Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 504

    int main(int argc, char* argv[])
    {
    int fd;
    int zero = 0;
    char buf[LEN];

    memset(buf, 0, LEN);

    fd = socket(AF_INET6, SOCK_RAW, 7);

    setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
    setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);

    sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
    }

    Signed-off-by: Dave Jones
    Signed-off-by: David S. Miller

    Dave Jones
     
  • Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
    the packet. For sockets that have transport headers pulled, transport
    offset can be negative. Use signed comparison to avoid overflow.

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Reported-by: Nisar Jagabar
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • When matching on flags, we should require the user to provide the
    mask and avoid using an all-ones mask. Not doing so causes matching
    on flags provided w.o mask to hit on the value being unset for all
    flags, which may not what the user wanted to happen.

    Fixes: faa3ffce7829 ('net/sched: cls_flower: Add support for matching on flags')
    Signed-off-by: Or Gerlitz
    Reported-by: Paul Blakey
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Or Gerlitz
     
  • The UDP dst port was provided to the helper function which sets the
    IPv6 IP tunnel meta-data under a wrong param order, fix that.

    Fixes: 75bfbca01e48 ('net/sched: act_tunnel_key: Add UDP dst port option')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Hadar Hen Zion
    Signed-off-by: David S. Miller

    Or Gerlitz
     

23 Dec, 2016

1 commit

  • Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
    protocols.") made ip_do_redirect call sock_net(sk) to determine
    the network namespace of the passed-in socket. This crashes if sk
    is NULL.

    Fix this by getting the network namespace from the skb instead.

    Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.")
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

22 Dec, 2016

1 commit

  • Madalin reported crashes happening in tcp_tasklet_func() on powerpc64

    Before TSQ_QUEUED bit is cleared, we must ensure the changes done
    by list_del(&tp->tsq_node); are committed to memory, otherwise
    corruption might happen, as an other cpu could catch TSQ_QUEUED
    clearance too soon.

    We can notice that old kernels were immune to this bug, because
    TSQ_QUEUED was cleared after a bh_lock_sock(sk)/bh_unlock_sock(sk)
    section, but they could have missed a kick to write additional bytes,
    when NIC interrupts for a given flow are spread to multiple cpus.

    Affected TCP flows would need an incoming ACK or RTO timer to add more
    packets to the pipe. So overall situation should be better now.

    Fixes: b223feb9de2a ("tcp: tsq: add shortcut in tcp_tasklet_func()")
    Signed-off-by: Eric Dumazet
    Reported-by: Madalin Bucur
    Tested-by: Madalin Bucur
    Tested-by: Xing Lei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Dec, 2016

6 commits

  • To make the code clearer, use rb_entry() instead of container_of() to
    deal with rbtree.

    Signed-off-by: Geliang Tang
    Reviewed-by: Leon Romanovsky
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • To make the code clearer, use rb_entry() instead of container_of() to
    deal with rbtree.

    Signed-off-by: Geliang Tang
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • To make the code clearer, use rb_entry() instead of container_of() to
    deal with rbtree.

    Signed-off-by: Geliang Tang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • sctp.local_addr_list is a global address list that is supposed to include
    all the local addresses. sctp updates this list according to NETDEV_UP/
    NETDEV_DOWN notifications.

    However, if multiple NICs have the same address, the global list would
    have duplicate addresses. Even if for one NIC, promote secondaries in
    __inet_del_ifa can also lead to accumulating duplicate addresses.

    When sctp binds address 'ANY' and creates a connection, it copies all
    the addresses from global list into asoc's bind addr list, which makes
    sctp pack the duplicate addresses into INIT/INIT_ACK packets.

    This patch is to filter the duplicate addresses when copying the addrs
    from global list in sctp_copy_local_addr_list and unpacking addr_param
    from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list.

    Note that we can't filter the duplicate addrs when global address list
    gets updated, As NETDEV_DOWN event may remove an addr that still exists
    in another NIC.

    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to reduce indent level by using continue when the addr
    is not allowed, and also drop end_copy by using break.

    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • Add a break statement to prevent fall-through from
    OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break
    actions setting ethernet addresses fail to validate with log messages
    complaining about invalid tunnel attributes.

    Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
    Signed-off-by: Jarno Rajahalme
    Acked-by: Pravin B Shelar
    Acked-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jarno Rajahalme