05 Jun, 2017

1 commit

  • Alexander reported various KASAN messages triggered in recent kernels

    The problem is that ping sockets should not use udp_poll() in the first
    place, and recent changes in UDP stack finally exposed this old bug.

    Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
    Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.")
    Signed-off-by: Eric Dumazet
    Reported-by: Sasha Levin
    Cc: Solar Designer
    Cc: Vasiliy Kulikov
    Cc: Lorenzo Colitti
    Acked-By: Lorenzo Colitti
    Tested-By: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Jun, 2017

1 commit

  • When the sender switches its congestion control during loss
    recovery, if the recovery is spurious then it may incorrectly
    revert cwnd and ssthresh to the older values set by a previous
    congestion control. Consider a congestion control (like BBR)
    that does not use ssthresh and keeps it infinite: the connection
    may incorrectly revert cwnd to an infinite value when switching
    from BBR to another congestion control.

    This patch fixes it by disallowing such cwnd undo operation
    upon switching congestion control. Note that undo_marker
    is not reset s.t. the packets that were incorrectly marked
    lost would be corrected. We only avoid undoing the cwnd in
    tcp_undo_cwnd_reduction().

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

01 Jun, 2017

1 commit

  • MTU probing initialization occurred only at connect() and at SYN or
    SYN-ACK reception, but the former sets MSS to either the default or the
    user set value (through TCP_MAXSEG sockopt) and the latter never happens
    with repaired sockets.

    The result was that, with MTU probing enabled and unless TCP_MAXSEG
    sockopt was used before connect(), probing would be stuck at
    tcp_base_mss value until tcp_probe_interval seconds have passed.

    Signed-off-by: Douglas Caetano dos Santos
    Signed-off-by: David S. Miller

    Douglas Caetano dos Santos
     

27 May, 2017

1 commit

  • Andrey Konovalov reported crashes in ipv4_mtu()

    I could reproduce the issue with KASAN kernels, between
    10.246.7.151 and 10.246.7.152 :

    1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &

    2) At the same time run following loop :
    while :
    do
    ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
    ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
    done

    Cong Wang attempted to add back rt->fi in commit
    82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
    but this proved to add some issues that were complex to solve.

    Instead, I suggested to add a refcount to the metrics themselves,
    being a standalone object (in particular, no reference to other objects)

    I tried to make this patch as small as possible to ease its backport,
    instead of being super clean. Note that we believe that only ipv4 dst
    need to take care of the metric refcount. But if this is wrong,
    this patch adds the basic infrastructure to extend this to other
    families.

    Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
    for his efforts on this problem.

    Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Reviewed-by: Julian Anastasov
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 May, 2017

2 commits

  • Commit 7d472a59c0e5ec117220a05de6b370447fb6cb66 ("arp: always override
    existing neigh entries with gratuitous ARP") introduced a compiler
    warning:

    net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in
    this function [-Wmaybe-uninitialized]

    While the code logic seems to be correct and doesn't allow the variable
    to be used uninitialized, and the warning is not consistently
    reproducible, it's still worth fixing it for other people not to waste
    time looking at the warning in case it pops up in the build environment.
    Yes, compiler is probably at fault, but we will need to accommodate.

    Fixes: 7d472a59c0e5 ("arp: always override existing neigh entries with gratuitous ARP")
    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • Fastopen API should be used to perform fastopen operations on the TCP
    socket. It does not make sense to use fastopen API to perform disconnect
    by calling it with AF_UNSPEC. The fastopen data path is also prone to
    race conditions and bugs when using with AF_UNSPEC.

    One issue reported and analyzed by Vegard Nossum is as follows:
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Thread A: Thread B:
    ------------------------------------------------------------------------
    sendto()
    - tcp_sendmsg()
    - sk_stream_memory_free() = 0
    - goto wait_for_sndbuf
    - sk_stream_wait_memory()
    - sk_wait_event() // sleep
    | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
    | - tcp_sendmsg()
    | - tcp_sendmsg_fastopen()
    | - __inet_stream_connect()
    | - tcp_disconnect() //because of AF_UNSPEC
    | - tcp_transmit_skb()// send RST
    | - return 0; // no reconnect!
    | - sk_stream_wait_connect()
    | - sock_error()
    | - xchg(&sk->sk_err, 0)
    | - return -ECONNRESET
    - ... // wake up, see sk->sk_err == 0
    - skb_entail() on TCP_CLOSE socket

    If the connection is reopened then we will send a brand new SYN packet
    after thread A has already queued a buffer. At this point I think the
    socket internal state (sequence numbers etc.) becomes messed up.

    When the new connection is closed, the FIN-ACK is rejected because the
    sequence number is outside the window. The other side tries to
    retransmit,
    but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
    corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
    and return EOPNOTSUPP to user if such case happens.

    Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
    Reported-by: Vegard Nossum
    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

23 May, 2017

1 commit

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2017-05-23

    1) Fix wrong header offset for esp4 udpencap packets.

    2) Fix a stack access out of bounds when creating a bundle
    with sub policies. From Sabrina Dubroca.

    3) Fix slab-out-of-bounds in pfkey due to an incorrect
    sadb_x_sec_len calculation.

    4) We checked the wrong feature flags when taking down
    an interface with IPsec offload enabled.
    Fix from Ilan Tayari.

    5) Copy the anti replay sequence numbers when doing a state
    migration, otherwise we get out of sync with the sequence
    numbers. Fix from Antony Antony.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

22 May, 2017

5 commits

  • Currently, when arp_accept is 1, we always override existing neigh
    entries with incoming gratuitous ARP replies. Otherwise, we override
    them only if new replies satisfy _locktime_ conditional (packets arrive
    not earlier than _locktime_ seconds since the last update to the neigh
    entry).

    The idea behind locktime is to pick the very first (=> close) reply
    received in a unicast burst when ARP proxies are used. This helps to
    avoid ARP thrashing where Linux would switch back and forth from one
    proxy to another.

    This logic has nothing to do with gratuitous ARP replies that are
    generally not aligned in time when multiple IP address carriers send
    them into network.

    This patch enforces overriding of existing neigh entries by all incoming
    gratuitous ARP packets, irrespective of their time of arrival. This will
    make the kernel honour all incoming gratuitous ARP packets.

    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • The addr_type retrieval can be costly, so it's worth trying to avoid its
    calculation as much as possible. This patch makes it calculated only
    for gratuitous ARP packets. This is especially important since later we
    may want to move is_garp calculation outside of arp_accept block, at
    which point the costly operation will be executed for all setups.

    The patch is the result of a discussion in net-dev:
    http://marc.info/?l=linux-netdev&m=149506354216994

    Suggested-by: Julian Anastasov
    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • The code is quite involving already to earn a separate function for
    itself. If anything, it helps arp_process readability.

    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • the is_garp code deals just with gratuitous ARP packets, not every
    unsolicited packet.

    This patch is a result of a discussion in netdev:
    http://marc.info/?l=linux-netdev&m=149506354216994

    Suggested-by: Julian Anastasov
    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • When tcp_disconnect() is called, inet_csk_delack_init() sets
    icsk->icsk_ack.rcv_mss to 0.
    This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
    __tcp_select_window() call path to have division by 0 issue.
    So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.

    Reported-by: Andrey Konovalov
    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Wei Wang
     

18 May, 2017

1 commit

  • Since the udp memory accounting refactor, we don't need any more
    to export the *udp*_queue_rcv_skb(). Make them static and fix
    a couple of sparse warnings:

    net/ipv4/udp.c:1615:5: warning: symbol 'udp_queue_rcv_skb' was not
    declared. Should it be static?
    net/ipv6/udp.c:572:5: warning: symbol 'udpv6_queue_rcv_skb' was not
    declared. Should it be static?

    Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
    Fixes: c915fe13cbaa ("udplite: fix NULL pointer dereference")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

17 May, 2017

4 commits

  • When arp_accept is 1, gratuitous ARPs are supposed to override matching
    entries irrespective of whether they arrive during locktime. This was
    implemented in commit 56022a8fdd87 ("ipv4: arp: update neighbour address
    when a gratuitous arp is received and arp_accept is set")

    There is a glitch in the patch though. RFC 2002, section 4.6, "ARP,
    Proxy ARP, and Gratuitous ARP", defines gratuitous ARPs so that they can
    be either of Request or Reply type. Those Reply gratuitous ARPs can be
    triggered with standard tooling, for example, arping -A option does just
    that.

    This patch fixes the glitch, making both Request and Reply flavours of
    gratuitous ARPs to behave identically.

    As per RFC, if gratuitous ARPs are of Reply type, their Target Hardware
    Address field should also be set to the link-layer address to which this
    cache entry should be updated. The field is present in ARP over Ethernet
    but not in IEEE 1394. In this patch, I don't consider any broadcasted
    ARP replies as gratuitous if the field is not present, to conform the
    standard. It's not clear whether there is such a thing for IEEE 1394 as
    a gratuitous ARP reply; until it's cleared up, we will ignore such
    broadcasts. Note that they will still update existing ARP cache entries,
    assuming they arrive out of locktime time interval.

    Signed-off-by: Ihar Hrachyshka
    Signed-off-by: David S. Miller

    Ihar Hrachyshka
     
  • In general, rtnetlink dumps do not anticipate failure to dump a single
    object (e.g., link or route) on a single pass. As both route and link
    objects have grown via more attributes, that is no longer a given.

    netlink dumps can handle a failure if the dump function returns an
    error; specifically, netlink_dump adds the return code to the response
    if it is len != 0). IPv6 route dumps
    (rt6_dump_route) already return the error; this patch updates IPv4 and
    link dumps. Other dump functions may need to be ajusted as well.

    Reported-by: Jan Moskyto Matejka
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • The skb->dev that is passed into ip_mr_input is
    the loX device for VRFs. When we lookup a vif
    for this dev, none is found as we do not create
    vifs for loopbacks. Instead lookup a vif for the
    actual device that the packet was received on,
    eg the vlan.

    Signed-off-by: Thomas Winter
    cc: David Ahern
    cc: Nikolay Aleksandrov
    cc: roopa
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Thomas Winter
     
  • tcp_ack() can call tcp_fragment() which may dededuct the
    value tp->fackets_out when MSS changes. When prior_fackets
    is larger than tp->fackets_out, tcp_clean_rtx_queue() can
    invoke tcp_update_reordering() with negative values. This
    results in absurd tp->reodering values higher than
    sysctl_tcp_max_reordering.

    Note that tcp_update_reordering indeeds sets tp->reordering
    to min(sysctl_tcp_max_reordering, metric), but because
    the comparison is signed, a negative metric always wins.

    Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes")
    Reported-by: Rebecca Isaacs
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

16 May, 2017

1 commit

  • Pull networking fixes from David Miller:

    1) Track alignment in BPF verifier so that legitimate programs won't be
    rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

    2) Make tail calls work properly in arm64 BPF JIT, from Deniel
    Borkmann.

    3) Make the configuration and semantics Generic XDP make more sense and
    don't allow both generic XDP and a driver specific instance to be
    active at the same time. Also from Daniel.

    4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.

    5) Fix use-after-free in VRF driver, from Gao Feng.

    6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
    qca_spi driver, from Stefan Wahren.

    7) Always run cleanup routines in BPF samples when we get SIGTERM, from
    Andy Gospodarek.

    8) The mdio phy code should bring PHYs out of reset using the shared
    GPIO lines before invoking bus->reset(). From Florian Fainelli.

    9) Some USB descriptor access endian fixes in various drivers from
    Johan Hovold.

    10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
    Pressman.

    11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.

    12) Cure netdev leak in AF_PACKET when using timestamping via control
    messages. From Douglas Caetano dos Santos.

    13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
    Lichvar.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
    ldmvsw: stop the clean timer at beginning of remove
    ldmvsw: unregistering netdev before disable hardware
    net: netcp: fix check of requested timestamping filter
    ipv6: avoid dad-failures for addresses with NODAD
    qed: Fix uninitialized data in aRFS infrastructure
    mdio: mux: fix device_node_continue.cocci warnings
    net/packet: fix missing net_device reference release
    net/mlx4_core: Use min3 to select number of MSI-X vectors
    macvlan: Fix performance issues with vlan tagged packets
    net: stmmac: use correct pointer when printing normal descriptor ring
    net/mlx5: Use underlay QPN from the root name space
    net/mlx5e: IPoIB, Only support regular RQ for now
    net/mlx5e: Fix setup TC ndo
    net/mlx5e: Fix ethtool pause support and advertise reporting
    net/mlx5e: Use the correct pause values for ethtool advertising
    vmxnet3: ensure that adapter is in proper state during force_close
    sfc: revert changes to NIC revision numbers
    net: ch9200: add missing USB-descriptor endianness conversions
    net: irda: irda-usb: fix firmware name on big-endian hosts
    net: dsa: mv88e6xxx: add default case to switch
    ...

    Linus Torvalds
     

12 May, 2017

1 commit

  • This patch fixes a bug in splitting an SKB during SACK
    processing. Specifically if an skb contains multiple
    packets and is only partially sacked in the higher sequences,
    tcp_match_sack_to_skb() splits the skb and marks the second fragment
    as SACKed.

    The current code further attempts rounding up the first fragment
    to MSS boundaries. But it misses a boundary condition when the
    rounded-up fragment size (pkt_len) is exactly skb size. Spliting
    such an skb is pointless and causses a kernel warning and aborts
    the SACK processing. This patch universally checks such over-split
    before calling tcp_fragment to prevent these unnecessary warnings.

    Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

10 May, 2017

2 commits

  • Pull networking fixes from David Miller:

    1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

    2) cdc_ncm doesn't actually fully zero out the padding area is
    allocates on TX, from Jim Baxter.

    3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

    4) If we randomize TCP timestamps, we have to do it everywhere
    including SYN cookies. From Eric Dumazet.

    5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

    6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from
    Dan Carpenter.

    7) Add missing memory allocation return value check to DSA loop driver,
    from Christophe Jaillet.

    8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
    Kalluru.

    9) Don't inherit MC list from parent inet connection sockets, another
    syzkaller spotted gem. Fix from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
    dccp/tcp: do not inherit mc_list from parent
    qede: Split PF/VF ndos.
    qed: Correct doorbell configuration for !4Kb pages
    qed: Tell QM the number of tasks
    qed: Fix VF removal sequence
    qede: Fix XDP memory leak on unload
    net/mlx4_core: Reduce harmless SRIOV error message to debug level
    net/mlx4_en: Avoid adding steering rules with invalid ring
    net/mlx4_en: Change the error print to debug print
    drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison
    DECnet: Use container_of() for embedded struct
    Revert "ipv4: restore rt->fi for reference counting"
    net: mdio-mux: bcm-iproc: call mdiobus_free() in error path
    net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
    ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
    net: cdc_ncm: Fix TX zero padding
    stmmac: pci: split out common_default_data() helper
    stmmac: pci: RX queue routing configuration
    stmmac: pci: TX and RX queue priority configuration
    stmmac: pci: set default number of rx and tx queues
    ...

    Linus Torvalds
     
  • syzkaller found a way to trigger double frees from ip_mc_drop_socket()

    It turns out that leave a copy of parent mc_list at accept() time,
    which is very bad.

    Very similar to commit 8b485ce69876 ("tcp: do not inherit
    fastopen_req from parent")

    Initial report from Pray3r, completed by Andrey one.
    Thanks a lot to them !

    Signed-off-by: Eric Dumazet
    Reported-by: Pray3r
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 May, 2017

5 commits

  • This reverts commit 82486aa6f1b9bc8145e6d0fa2bc0b44307f3b875.

    As implemented, this causes dangling netdevice refs.

    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     
  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • Congestion control modules that want full control over congestion
    control behavior do not want the cwnd modifications controlled by
    the sysctl_tcp_slow_start_after_idle code path.
    So skip those code paths for CC modules that use the cong_control()
    API.
    As an example, those cwnd effects are not desired for the BBR congestion
    control algorithm.

    Fixes: c0402760f565 ("tcp: new CC hook to set sending rate with rate_sample in any CA state")
    Signed-off-by: Wei Wang
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Wei Wang
     
  • IPv4 dst could use fi->fib_metrics to store metrics but fib_info
    itself is refcnt'ed, so without taking a refcnt fi and
    fi->fib_metrics could be freed while dst metrics still points to
    it. This triggers use-after-free as reported by Andrey twice.

    This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
    restore this reference counting. It is a quick fix for -net and
    -stable, for -net-next, as Eric suggested, we can consider doing
    reference counting for metrics itself instead of relying on fib_info.

    IPv6 is very different, it copies or steals the metrics from mx6_config
    in fib6_commit_metrics() so probably doesn't need a refcnt.

    Decnet has already done the refcnt'ing, see dn_fib_semantic_match().

    Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    WANG Cong
     

06 May, 2017

1 commit

  • Whole point of randomization was to hide server uptime, but an attacker
    can simply start a syn flood and TCP generates 'old style' timestamps,
    directly revealing server jiffies value.

    Also, TSval sent by the server to a particular remote address vary
    depending on syncookies being sent or not, potentially triggering PAWS
    drops for innocent clients.

    Lets implement proper randomization, including for SYNcookies.

    Also we do not need to export sysctl_tcp_timestamps, since it is not
    used from a module.

    In v2, I added Florian feedback and contribution, adding tsoff to
    tcp_get_cookie_sock().

    v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

    Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Florian Westphal
    Tested-by: Florian Westphal
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 May, 2017

3 commits

  • raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied
    from the userspace contains the IPv4/IPv6 header, so if too few bytes are
    copied, parts of the header may remain uninitialized.

    This bug has been detected with KMSAN.

    For the record, the KMSAN report:

    ==================================================================
    BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0
    inter: 0
    CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16
    dump_stack+0x143/0x1b0 lib/dump_stack.c:52
    kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078
    __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510
    nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577
    ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
    nf_hook_entry_hookfn ./include/linux/netfilter.h:102
    nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310
    nf_hook ./include/linux/netfilter.h:212
    NF_HOOK ./include/linux/netfilter.h:255
    rawv6_send_hdrinc net/ipv6/raw.c:673
    rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919
    inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633
    sock_sendmsg net/socket.c:643
    SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
    SyS_sendto+0xbc/0xe0 net/socket.c:1664
    do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
    entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246
    RIP: 0033:0x436e03
    RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000
    origin: 00000000d9400053
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362
    kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257
    kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270
    slab_alloc_node mm/slub.c:2735
    __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341
    __kmalloc_reserve net/core/skbuff.c:138
    __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231
    alloc_skb ./include/linux/skbuff.h:933
    alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678
    sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903
    sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920
    rawv6_send_hdrinc net/ipv6/raw.c:638
    rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919
    inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633
    sock_sendmsg net/socket.c:643
    SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
    SyS_sendto+0xbc/0xe0 net/socket.c:1664
    do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
    return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246
    ==================================================================

    , triggered by the following syscalls:
    socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
    sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM

    A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket
    instead of a PF_INET6 one.

    Signed-off-by: Alexander Potapenko
    Signed-off-by: David S. Miller

    Alexander Potapenko
     
  • Under fuzzer stress, it is possible that a child gets a non NULL
    fastopen_req pointer from its parent at accept() time, when/if parent
    morphs from listener to active session.

    We need to make sure this can not happen, by clearing the field after
    socket cloning.

    BUG: Double free or freeing an invalid pointer
    Unexpected shadow byte: 0xFB
    CPU: 3 PID: 20933 Comm: syz-executor3 Not tainted 4.11.0+ #306
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x292/0x395 lib/dump_stack.c:52
    kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
    kasan_report_double_free+0x5c/0x70 mm/kasan/report.c:185
    kasan_slab_free+0x9d/0xc0 mm/kasan/kasan.c:580
    slab_free_hook mm/slub.c:1357 [inline]
    slab_free_freelist_hook mm/slub.c:1379 [inline]
    slab_free mm/slub.c:2961 [inline]
    kfree+0xe8/0x2b0 mm/slub.c:3882
    tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
    tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
    inet_child_forget+0xb8/0x600 net/ipv4/inet_connection_sock.c:898
    inet_csk_reqsk_queue_add+0x1e7/0x250
    net/ipv4/inet_connection_sock.c:928
    tcp_get_cookie_sock+0x21a/0x510 net/ipv4/syncookies.c:217
    cookie_v4_check+0x1a19/0x28b0 net/ipv4/syncookies.c:384
    tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1384 [inline]
    tcp_v4_do_rcv+0x731/0x940 net/ipv4/tcp_ipv4.c:1421
    tcp_v4_rcv+0x2dc0/0x31c0 net/ipv4/tcp_ipv4.c:1715
    ip_local_deliver_finish+0x4cc/0xc20 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip_local_deliver+0x1ce/0x700 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:492 [inline]
    ip_rcv_finish+0xb1d/0x20b0 net/ipv4/ip_input.c:396
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip_rcv+0xd8c/0x19c0 net/ipv4/ip_input.c:487
    __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4210
    __netif_receive_skb+0x2a/0x1a0 net/core/dev.c:4248
    process_backlog+0xe5/0x6c0 net/core/dev.c:4868
    napi_poll net/core/dev.c:5270 [inline]
    net_rx_action+0xe70/0x18e0 net/core/dev.c:5335
    __do_softirq+0x2fb/0xb99 kernel/softirq.c:284
    do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:899

    do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
    do_softirq kernel/softirq.c:176 [inline]
    __local_bh_enable_ip+0x1cf/0x1e0 kernel/softirq.c:181
    local_bh_enable include/linux/bottom_half.h:31 [inline]
    rcu_read_unlock_bh include/linux/rcupdate.h:931 [inline]
    ip_finish_output2+0x9ab/0x15e0 net/ipv4/ip_output.c:230
    ip_finish_output+0xa35/0xdf0 net/ipv4/ip_output.c:316
    NF_HOOK_COND include/linux/netfilter.h:246 [inline]
    ip_output+0x1f6/0x7b0 net/ipv4/ip_output.c:404
    dst_output include/net/dst.h:486 [inline]
    ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
    ip_queue_xmit+0x9a8/0x1a10 net/ipv4/ip_output.c:503
    tcp_transmit_skb+0x1ade/0x3470 net/ipv4/tcp_output.c:1057
    tcp_write_xmit+0x79e/0x55b0 net/ipv4/tcp_output.c:2265
    __tcp_push_pending_frames+0xfa/0x3a0 net/ipv4/tcp_output.c:2450
    tcp_push+0x4ee/0x780 net/ipv4/tcp.c:683
    tcp_sendmsg+0x128d/0x39b0 net/ipv4/tcp.c:1342
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x660/0x810 net/socket.c:1696
    SyS_sendto+0x40/0x50 net/socket.c:1664
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x446059
    RSP: 002b:00007faa6761fb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 0000000000446059
    RDX: 0000000000000001 RSI: 0000000020ba3fcd RDI: 0000000000000017
    RBP: 00000000006e40a0 R08: 0000000020ba4ff0 R09: 0000000000000010
    R10: 0000000020000000 R11: 0000000000000282 R12: 0000000000708150
    R13: 0000000000000000 R14: 00007faa676209c0 R15: 00007faa67620700
    Object at ffff88003b5bbcb8, in cache kmalloc-64 size: 64
    Allocated:
    PID = 20909
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:513
    set_track mm/kasan/kasan.c:525 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
    kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
    kmalloc include/linux/slab.h:490 [inline]
    kzalloc include/linux/slab.h:663 [inline]
    tcp_sendmsg_fastopen net/ipv4/tcp.c:1094 [inline]
    tcp_sendmsg+0x221a/0x39b0 net/ipv4/tcp.c:1139
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x660/0x810 net/socket.c:1696
    SyS_sendto+0x40/0x50 net/socket.c:1664
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    Freed:
    PID = 20909
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:513
    set_track mm/kasan/kasan.c:525 [inline]
    kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
    slab_free_hook mm/slub.c:1357 [inline]
    slab_free_freelist_hook mm/slub.c:1379 [inline]
    slab_free mm/slub.c:2961 [inline]
    kfree+0xe8/0x2b0 mm/slub.c:3882
    tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
    tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
    __inet_stream_connect+0x20c/0xf90 net/ipv4/af_inet.c:593
    tcp_sendmsg_fastopen net/ipv4/tcp.c:1111 [inline]
    tcp_sendmsg+0x23a8/0x39b0 net/ipv4/tcp.c:1139
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x660/0x810 net/socket.c:1696
    SyS_sendto+0x40/0x50 net/socket.c:1664
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 7db92362d2fe ("tcp: fix potential double free issue for fastopen_req")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Locally generated TCP packets are usually cloned, so we
    do skb_cow_data() on this packets. After that we need to
    reload the pointer to the esp header. On udpencap this
    header has an offset to skb_transport_header, so take this
    offset into account.

    Fixes: 67d349ed603 ("net/esp4: Fix invalid esph pointer crash")
    Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output")
    Reported-by: Don Bowman
    Signed-off-by: Steffen Klassert

    Steffen Klassert
     

03 May, 2017

2 commits

  • Pull networking updates from David Millar:
    "Here are some highlights from the 2065 networking commits that
    happened this development cycle:

    1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

    2) Add a generic XDP driver, so that anyone can test XDP even if they
    lack a networking device whose driver has explicit XDP support
    (me).

    3) Sparc64 now has an eBPF JIT too (me)

    4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
    Starovoitov)

    5) Make netfitler network namespace teardown less expensive (Florian
    Westphal)

    6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

    7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

    8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

    9) Multiqueue support in stmmac driver (Joao Pinto)

    10) Remove TCP timewait recycling, it never really could possibly work
    well in the real world and timestamp randomization really zaps any
    hint of usability this feature had (Soheil Hassas Yeganeh)

    11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
    Aleksandrov)

    12) Add socket busy poll support to epoll (Sridhar Samudrala)

    13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
    and several others)

    14) IPSEC hw offload infrastructure (Steffen Klassert)"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
    tipc: refactor function tipc_sk_recv_stream()
    tipc: refactor function tipc_sk_recvmsg()
    net: thunderx: Optimize page recycling for XDP
    net: thunderx: Support for XDP header adjustment
    net: thunderx: Add support for XDP_TX
    net: thunderx: Add support for XDP_DROP
    net: thunderx: Add basic XDP support
    net: thunderx: Cleanup receive buffer allocation
    net: thunderx: Optimize CQE_TX handling
    net: thunderx: Optimize RBDR descriptor handling
    net: thunderx: Support for page recycling
    ipx: call ipxitf_put() in ioctl error path
    net: sched: add helpers to handle extended actions
    qed*: Fix issues in the ptp filter config implementation.
    qede: Fix concurrency issue in PTP Tx path processing.
    stmmac: Add support for SIMATIC IOT2000 platform
    net: hns: fix ethtool_get_strings overflow in hns driver
    tcp: fix wraparound issue in tcp_lp
    bpf, arm64: fix jit branch offset related to ldimm64
    bpf, arm64: implement jiting of BPF_XADD
    ...

    Linus Torvalds
     
  • Be careful when comparing tcp_time_stamp to some u32 quantity,
    otherwise result can be surprising.

    Fixes: 7c106d7e782b ("[TCP]: TCP Low Priority congestion control")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 May, 2017

1 commit

  • Both esp_output and esp_xmit take a pointer to the ESP header
    and place it in esp_info struct prior to calling esp_output_head.

    Inside esp_output_head, the call to esp_output_udp_encap
    makes sure to update the pointer if it gets invalid.
    However, if esp_output_head itself calls skb_cow_data, the
    pointer is not updated and stays invalid, causing a crash
    after esp_output_head returns.

    Update the pointer if it becomes invalid in esp_output_head

    Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output")
    Signed-off-by: Ilan Tayari
    Signed-off-by: David S. Miller

    Ilan Tayari
     

01 May, 2017

4 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. A large bunch of code cleanups, simplify the conntrack extension
    codebase, get rid of the fake conntrack object, speed up netns by
    selective synchronize_net() calls. More specifically, they are:

    1) Check for ct->status bit instead of using nfct_nat() from IPVS and
    Netfilter codebase, patch from Florian Westphal.

    2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

    3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

    4) Introduce nft_is_base_chain() helper function.

    5) Enforce expectation limit from userspace conntrack helper,
    from Gao Feng.

    6) Add nf_ct_remove_expect() helper function, from Gao Feng.

    7) NAT mangle helper function return boolean, from Gao Feng.

    8) ctnetlink_alloc_expect() should only work for conntrack with
    helpers, from Gao Feng.

    9) Add nfnl_msg_type() helper function to nfnetlink to build the
    netlink message type.

    10) Get rid of unnecessary cast on void, from simran singhal.

    11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

    12) Use list_prev_entry() from nf_tables, from simran signhal.

    13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

    14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

    15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

    16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

    17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

    18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

    19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

    20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

    21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

    22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

    23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

    24) Add event mask support to nft_ct, also from Florian.

    25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

    26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

    27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

    28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

    29) Shrink size of nf_conntrack_ecache structure, from Florian.

    30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

    31) Register SYNPROXY hooks on demand, from Florian Westphal.

    32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

    33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

    34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

    35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

    36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

    37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

    38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

    39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

    40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

    41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

    42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • net/ipv4/netfilter/nf_nat_snmp_basic.c:1158:1: warning: the frame size
    of 1160 bytes is larger than 1024 bytes

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
    we always take RTNL lock for ip_ra_control() which is the only place
    we update the list ip_ra_chain, so the ip_ra_lock is no longer needed.

    As Eric points out, BH does not need to disable either, RCU readers
    don't care.

    Signed-off-by: Cong Wang
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    WANG Cong
     
  • avoid direct access to sk->sk_state when tcp_poll() is called on a socket
    using active TCP fastopen with deferred connect. Use local variable
    'state', which stores the result of sk_state_load(), like it was done in
    commit 00fd38d938db ("tcp: ensure proper barriers in lockless contexts").

    Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
    Signed-off-by: Davide Caratti
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller

    Davide Caratti
     

29 Apr, 2017

2 commits

  • Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
    skb_try_coalesce() using syzkaller and a filter attached to a TCP
    socket over loopback interface.

    I believe one issue with looped skbs is that tcp_trim_head() can end up
    producing skb with under estimated truesize.

    It hardly matters for normal conditions, since packets sent over
    loopback are never truncated.

    Bytes trimmed from skb->head should not change skb truesize, since
    skb->head is not reallocated.

    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Upper layer GRO handlers can not handle IP fragments, so
    exit GRO processing in this case.

    This fixes ESP GRO because the packet must be reassembled
    before we can decapsulate, otherwise we get authentication
    failures.

    It also aligns IPv4 to IPv6 where packets with fragmentation
    headers are not passed to upper layer GRO handlers.

    Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert