14 Nov, 2016

3 commits

  • With syzkaller help, Marco Grassi found a bug in TCP stack,
    crashing in tcp_collapse()

    Root cause is that sk_filter() can truncate the incoming skb,
    but TCP stack was not really expecting this to happen.
    It probably was expecting a simple DROP or ACCEPT behavior.

    We first need to make sure no part of TCP header could be removed.
    Then we need to adjust TCP_SKB_CB(skb)->end_seq

    Many thanks to syzkaller team and Marco for giving us a reproducer.

    Signed-off-by: Eric Dumazet
    Reported-by: Marco Grassi
    Reported-by: Vladis Dronov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
    and then the state of the neigh for the new_gw is checked. If the state
    isn't valid then the redirected route is deleted. This behavior is
    maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
    is assigned to peer->redirect_learned.a4 before calling
    ipv4_neigh_lookup().

    After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in
    struct rtable again."), ipv4_neigh_lookup() is performed without the
    rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
    isn't zero, the function uses it as the key. The neigh is most likely
    valid since the old_gw is the one that sends the ICMP redirect message.
    Then the new_gw is assigned to fib_nh_exception. The problem is: the
    new_gw ARP may never gets resolved and the traffic is blackholed.

    So, use the new_gw for neigh lookup.

    Changes from v1:
    - use __ipv4_neigh_lookup instead (per Eric Dumazet).

    Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
    Signed-off-by: Stephen Suryaputra Lin
    Signed-off-by: David S. Miller

    Stephen Suryaputra Lin
     
  • If usb_submit_urb() called from the open function fails, the following
    crash may be observed.

    r8152 8-1:1.0 eth0: intr_urb submit failed: -19
    ...
    r8152 8-1:1.0 eth0: v1.08.3
    Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
    pgd = ffffffc0e7305000
    [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    ...
    PC is at notifier_chain_register+0x2c/0x58
    LR is at blocking_notifier_chain_register+0x54/0x70
    ...
    Call trace:
    [] notifier_chain_register+0x2c/0x58
    [] blocking_notifier_chain_register+0x54/0x70
    [] register_pm_notifier+0x24/0x2c
    [] rtl8152_open+0x3dc/0x3f8 [r8152]
    [] __dev_open+0xac/0x104
    [] __dev_change_flags+0xb0/0x148
    [] dev_change_flags+0x34/0x70
    [] do_setlink+0x2c8/0x888
    [] rtnl_newlink+0x328/0x644
    [] rtnetlink_rcv_msg+0x1a8/0x1d4
    [] netlink_rcv_skb+0x68/0xd0
    [] rtnetlink_rcv+0x2c/0x3c
    [] netlink_unicast+0x16c/0x234
    [] netlink_sendmsg+0x340/0x364
    [] sock_sendmsg+0x48/0x60
    [] SyS_sendto+0xe0/0x120
    [] SyS_send+0x40/0x4c
    [] el0_svc_naked+0x24/0x28

    Clean up error handling to avoid registering the notifier if the open
    function is going to fail.

    Signed-off-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Guenter Roeck
     

13 Nov, 2016

5 commits

  • __LINUX_IF_ETHER_H is not defined anywhere, and if_ether.h can keep itself from
    double inclusion, though it uses a single underscore prefix.

    Signed-off-by: Baruch Siach
    Signed-off-by: David S. Miller

    Baruch Siach
     
  • After Tom patch, thoff field could point past the end of the buffer,
    this could fool some callers.

    If an skb was provided, skb->len should be the upper limit.
    If not, hlen is supposed to be the upper limit.

    Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect")
    Signed-off-by: Eric Dumazet
    Reported-by: Yibin Yang
    Acked-by: Willem de Bruijn
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Martin KaFai Lau says:

    ====================
    bpf: Fix bpf_redirect to an ipip/ip6tnl dev

    This patch set fixes a bug in bpf_redirect(dev, flags) when dev is an
    ipip/ip6tnl. The current problem is IP-EthHdr-IP is sent out instead of
    IP-IP.

    Patch 1 adds a dev->type test similar to dev_is_mac_header_xmit()
    in act_mirred.c which is only available in net-next. We can consider to
    refactor it once this patch is pulled into net-next from net.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The test creates two netns, ns1 and ns2. The host (the default netns)
    has an ipip or ip6tnl dev configured for tunneling traffic to the ns2.

    ping VIPS from ns1 host ns2 (VIPs at loopback)

    The test is to have ns1 pinging VIPs configured at the loopback
    interface in ns2.

    The VIPs are 10.10.1.102 and 2401:face::66 (which are configured
    at lo@ns2). [Note: 0x66 => 102].

    At ns1, the VIPs are routed _via_ the host.

    At the host, bpf programs are installed at the veth to redirect packets
    from a veth to the ipip/ip6tnl. The test is configured in a way so
    that both ingress and egress can be tested.

    At ns2, the ipip/ip6tnl dev is configured with the local and remote address
    specified. The return path is routed to the dev ipip/ip6tnl.

    During egress test, the host also locally tests pinging the VIPs to ensure
    that bpf_redirect at egress also works for the direct egress (i.e. not
    forwarding from dev ve1 to ve2).

    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • If the bpf program calls bpf_redirect(dev, 0) and dev is
    an ipip/ip6tnl, it currently includes the mac header.
    e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
    of IP-IP.

    The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
    is not needed because the ethhdr should have been pulled once already
    and then got pushed back just before calling the bpf_prog.
    At egress, this patch calls skb_postpull_rcsum().

    If bpf_redirect(dev, BPF_F_INGRESS) is called,
    it also fails now because it calls dev_forward_skb() which
    eventually calls eth_type_trans(skb, dev). The eth_type_trans()
    will set skb->type = PACKET_OTHERHOST because the mac address
    does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
    will eventually cause the ip_rcv() errors out. To fix this,
    ____dev_forward_skb() is added.

    Joint work with Daniel Borkmann.

    Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
    Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

11 Nov, 2016

7 commits

  • Jiri Pirko says:

    ====================
    mlxsw: Couple of router fixes

    v1->v2:
    - patch2:
    - use net_eq
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Since now, the table with same id in multiple netnamespaces were squashed
    to a single virtual router. That is not only incorrect, it also causes
    error messages when trying to use RALUE register to do double remove
    of FIB entries, like this one:

    mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=facb831c00007b20,reg_id=8013(ralue),type=write,status=7(bad parameter))

    Since we don't allow ports to change namespaces (NETIF_F_NETNS_LOCAL),
    and the infrastructure is not yet prepared to handle netnamespaces, just
    ignore FIB notification events for non-init namespaces. That is clear to
    do since we don't need to offload them.

    Fixes: b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls")
    Signed-off-by: Jiri Pirko
    Acked-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • __neigh_create function works in a different way than assumed.
    It passes "n" as a parameter to ndo_neigh_construct. But this "n" might
    be destroyed right away before __neigh_create() returns in case there is
    already another neighbour struct in the hashtable with the same dev and
    primary key. That is not expected by mlxsw_sp_router_neigh_construct()
    and the stored "n" points to freed memory, eventually leading to crash.

    Fix this by doing tight 1:1 coupling between neighbour struct and
    internal driver neigh_entry. That allows to narrow down the key in
    internal driver hashtable to do lookups by "n" only.

    Fixes: 6cf3c971dc84 ("mlxsw: spectrum_router: Add private neigh table")
    Signed-off-by: Jiri Pirko
    Acked-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Yuval Mintz says:

    ====================
    qed: Fix RoCE infrastructure

    This series fixes 2 basic issues with RoCE support,
    one handles a missing configuration in the initial infrastructure
    support while the other is a regression introduced by one of the
    initial fix submissions.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Previous fix has broken RoCE support as the rdma_pf_params are now
    being set into the parameters only after the params are alrady assigned
    into the hw-function.

    Fixes: 0189efb8f4f8 ("qed*: Fix Kconfig dependencies with INFINIBAND_QEDR")
    Signed-off-by: Ram Amrani
    Signed-off-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Ram Amrani
     
  • Currently RoCE v2 won't operate with RDMA CM due to missing setting of
    the roce-flavour in the ll2 configuration.
    This patch properly sets the flavour, and deletes incorrect HSI
    that doesn't [yet] exist.

    Fixes: abd49676c707 ("qed: Add RoCE ll2 & GSI support")
    Signed-off-by: Ram Amrani
    Signed-off-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Ram Amrani
     
  • This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local
    fragmentation in ip_finish_output_gso()"), updating the comment
    documenting cases in which fragmentation is needed for egress
    GSO packets.

    Suggested-by: Shmulik Ladkani
    Reviewed-by: Shmulik Ladkani
    Signed-off-by: Lance Richardson
    Signed-off-by: David S. Miller

    Lance Richardson
     

10 Nov, 2016

16 commits

  • Lorenzo noted an Android unit test failed due to e0d56fdd7342:
    "The expectation in the test was that the RST replying to a SYN sent to a
    closed port should be generated with oif=0. In other words it should not
    prefer the interface where the SYN came in on, but instead should follow
    whatever the routing table says it should do."

    Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
    that the oif in the flow is set to the skb_iif only if skb_iif is an L3
    master.

    Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls")
    Reported-by: Lorenzo Colitti
    Signed-off-by: David Ahern
    Tested-by: Lorenzo Colitti
    Acked-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add support for Cypress GX3 SuperSpeed to Gigabit Ethernet
    Bridge Controller (Vendor=04b4 ProdID=3610).

    Patch verified on x64 linux kernel 4.7.4, 4.8.6, 4.9-rc4 systems
    with the Kensington SD4600P USB-C Universal Dock with Power,
    which uses the Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge
    Controller.

    A similar patch was signed-off and tested-by Allan Chou
    on 2015-12-01.

    Allan verified his similar patch on x86 Linux kernel 4.1.6 system
    with Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge Controller.

    Tested-by: Allan Chou
    Tested-by: Chris Roth
    Tested-by: Artjom Simon

    Signed-off-by: Allan Chou
    Signed-off-by: Chris Roth
    Signed-off-by: David S. Miller

    Allan Chou
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains a larger than usual batch of Netfilter
    fixes for your net tree. This series contains a mixture of old bugs and
    recently introduced bugs, they are:

    1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
    support the set element updates from the packet path. From Liping
    Zhang.

    2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

    3) Fix a race when inserting new elements to the set hash from the
    packet path, also from Liping.

    4) Handle segmented TCP SIP packets properly, basically avoid that the
    INVITE in the allow header create bogus expectations by performing
    stricter SIP message parsing, from Ulrich Weber.

    5) nft_parse_u32_check() should return signed integer for errors, from
    John Linville.

    6) Fix wrong allocation instead of connlabels, allocate 16 instead of
    32 bytes, from Florian Westphal.

    7) Fix compilation breakage when building the ip_vs_sync code with
    CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

    8) Destroy the new set if the transaction object cannot be allocated,
    also from Liping Zhang.

    9) Use device to route duplicated packets via nft_dup only when set by
    the user, otherwise packets may not follow the right route, again
    from Liping.

    10) Fix wrong maximum genetlink attribute definition in IPVS, from
    WANG Cong.

    11) Ignore untracked conntrack objects from xt_connmark, from Florian
    Westphal.

    12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
    via CT target, otherwise we cannot use the h.245 helper, from
    Florian.

    13) Revisit garbage collection heuristic in the new workqueue-based
    timer approach for conntrack to evict objects earlier, again from
    Florian.

    14) Fix crash in nf_tables when inserting an element into a verdict map,
    from Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • To avoid having dangling function pointers left behind, reset calcit in
    rtnl_unregister(), too.

    This is no issue so far, as only the rtnl core registers a netlink
    handler with a calcit hook which won't be unregistered, but may become
    one if new code makes use of the calcit hook.

    Fixes: c7ac8679bec9 ("rtnetlink: Compute and store minimum ifinfo...")
    Cc: Jeff Kirsher
    Cc: Greg Rose
    Signed-off-by: Mathias Krause
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • A bugfix introduced a harmless warning in v4.9-rc4:

    drivers/net/vxlan.c: In function 'vxlan_group_used':
    drivers/net/vxlan.c:947:21: error: unused variable 'sock6' [-Werror=unused-variable]

    This hides the variable inside of the same #ifdef that is
    around its user. The extraneous initialization is removed
    at the same time, it was accidentally introduced in the
    same commit.

    Fixes: c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.")
    Signed-off-by: Arnd Bergmann
    Acked-by: Jiri Benc
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • Use the opt_* fields to determine the starting point for negotiating the
    number of tx/rx completion queues with the vnic server. These contain the
    number of queues that the vnic server estimates that it will be able to
    allocate. While renegotiation may still occur, using the opt_* fields will
    reduce the number of times this needs to happen and will prevent driver
    probe timeout on systems using large numbers of ibmvnic client devices per
    vnic port.

    Signed-off-by: John Allen
    Signed-off-by: David S. Miller

    John Allen
     
  • icmp_send is called in response to some event. The skb may not have
    the device set (skb->dev is NULL), but it is expected to have an rt.
    Update icmp_route_lookup to use the rt on the skb to determine L3
    domain.

    Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Timur Tabi says:

    ====================
    net: qcom/emac: ensure that pause frames are enabled

    The qcom emac driver experiences significant packet loss (through frame
    check sequence errors) if flow control is not enabled and the phy is
    not configured to allow pause frames to pass through it. Therefore, we
    need to enable flow control and force the phy to pass pause frames.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • If the PHY has been configured to allow pause frames, then the MAC
    should be configured to generate and/or accept those frames.

    Signed-off-by: Timur Tabi
    Signed-off-by: David S. Miller

    Timur Tabi
     
  • Pause frames are used to enable flow control. A MAC can send and
    receive pause frames in order to throttle traffic. However, the PHY
    must be configured to allow those frames to pass through.

    Reviewed-by: Florian Fainelli
    Signed-off-by: Timur Tabi
    Signed-off-by: David S. Miller

    Timur Tabi
     
  • This fixes regression introduced by patch adding feature flags. It was
    already reported and patch followed (it got accepted) but it appears it
    was incorrect. Instead of fixing reversed condition it broke a good one.

    This patch was verified to actually fix SoC hanges caused by bgmac on
    BCM47186B0.

    Fixes: db791eb2970b ("net: ethernet: bgmac: convert to feature flags")
    Fixes: 4af1474e6198 ("net: bgmac: Fix errant feature flag check")
    Cc: Jon Mason
    Signed-off-by: Rafał Miłecki
    Signed-off-by: David S. Miller

    Rafał Miłecki
     
  • We received two reports of BUG_ON in bnad_txcmpl_process() where
    hw_consumer_index appeared to be ahead of producer_index. Out of order
    write/read of these variables could explain these reports.

    bnad_start_xmit(), as a producer of tx descriptors, has a few memory
    barriers sprinkled around writes to producer_index and the device's
    doorbell but they're not paired with anything in bnad_txcmpl_process(), a
    consumer.

    Since we are synchronizing with a device, we must use mandatory barriers,
    not smp_*. Also, I didn't see the purpose of the last smp_mb() in
    bnad_start_xmit().

    Signed-off-by: Benjamin Poirier
    Signed-off-by: David S. Miller

    Benjamin Poirier
     
  • This reverts commit 9d2afba058722d40cc02f430229c91611c0e8d16.

    The original issue would possibly exist if an external module
    tried calling our "ethtool_ops" without checking if it still
    exists.

    The right way of solving it is by simply doing the check in
    the caller side.
    Currently, no action is required as there's no such use case.

    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Tariq Toukan
     
  • Routes can specify an mtu explicitly or inherit the mtu from
    the underlying device - this inheritance is implemented in
    dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

    Currently changing the mtu of a device adds mtu explicitly
    to routes using that device.

    ie.
    # ip link set dev lo mtu 65536
    # ip -6 route add local 2000::1 dev lo
    # ip -6 route get 2000::1
    local 2000::1 dev lo table local src ... metric 1024 pref medium

    # ip link set dev lo mtu 65535
    # ip -6 route get 2000::1
    local 2000::1 dev lo table local src ... metric 1024 mtu 65535 pref medium

    # ip link set dev lo mtu 65536
    # ip -6 route get 2000::1
    local 2000::1 dev lo table local src ... metric 1024 mtu 65536 pref medium

    # ip -6 route del local 2000::1

    After this patch the route entry no longer changes unless it already has an mtu.
    There is no need: this inheritance is already done in ip6_mtu()

    # ip link set dev lo mtu 65536
    # ip -6 route add local 2000::1 dev lo
    # ip -6 route add local 2000::2 dev lo mtu 2000
    # ip -6 route get 2000::1; ip -6 route get 2000::2
    local 2000::1 dev lo table local src ... metric 1024 pref medium
    local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

    # ip link set dev lo mtu 65535
    # ip -6 route get 2000::1; ip -6 route get 2000::2
    local 2000::1 dev lo table local src ... metric 1024 pref medium
    local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

    # ip link set dev lo mtu 1501
    # ip -6 route get 2000::1; ip -6 route get 2000::2
    local 2000::1 dev lo table local src ... metric 1024 pref medium
    local 2000::2 dev lo table local src ... metric 1024 mtu 1501 pref medium

    # ip link set dev lo mtu 65536
    # ip -6 route get 2000::1; ip -6 route get 2000::2
    local 2000::1 dev lo table local src ... metric 1024 pref medium
    local 2000::2 dev lo table local src ... metric 1024 mtu 65536 pref medium

    # ip -6 route del local 2000::1
    # ip -6 route del local 2000::2

    This is desirable because changing device mtu and then resetting it
    to the previous value shouldn't change the user visible routing table.

    Signed-off-by: Maciej Żenczykowski
    CC: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     
  • Do not send the next message in sendmmsg for partial sendmsg
    invocations.

    sendmmsg assumes that it can continue sending the next message
    when the return value of the individual sendmsg invocations
    is positive. It results in corrupting the data for TCP,
    SCTP, and UNIX streams.

    For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
    of "aefgh" if the first sendmsg invocation sends only the first
    byte while the second sendmsg goes through.

    Datagram sockets either send the entire datagram or fail, so
    this patch affects only sockets of type SOCK_STREAM and
    SOCK_SEQPACKET.

    Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Neal Cardwell
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • When there is no existing macvlan port in lowdev, one new macvlan port
    would be created. But it doesn't be destoried when something failed later.
    It casues some memleak.

    Now add one flag to indicate if new macvlan port is created.

    Signed-off-by: Gao Feng
    Signed-off-by: David S. Miller

    Gao Feng
     

09 Nov, 2016

5 commits

  • Dalegaard says:
    The following ruleset, when loaded with 'nft -f bad.txt'
    ----snip----
    flush ruleset
    table ip inlinenat {
    map sourcemap {
    type ipv4_addr : verdict;
    }

    chain postrouting {
    ip saddr vmap @sourcemap accept
    }
    }
    add chain inlinenat test
    add element inlinenat sourcemap { 100.123.10.2 : jump test }
    ----snip----

    results in a kernel oops:
    BUG: unable to handle kernel paging request at 0000000000001344
    IP: [] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
    [...]
    Call Trace:
    [] ? nft_data_init+0x13e/0x1a0 [nf_tables]
    [] nft_validate_register_store+0x60/0xb0 [nf_tables]
    [] nft_add_set_elem+0x545/0x5e0 [nf_tables]
    [] ? nft_table_lookup+0x30/0x60 [nf_tables]
    [] ? nla_strcmp+0x40/0x50
    [] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
    [] ? nla_validate+0x60/0x80
    [] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

    Because we forget to fill the net pointer in bind_ctx, so dereferencing
    it may cause kernel crash.

    Reported-by: Dalegaard
    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Nicolas Dichtel says:
    After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
    remove timed-out entries"), netlink conntrack deletion events may be
    sent with a huge delay.

    Nicolas further points at this line:

    goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

    and indeed, this isn't optimal at all. Rationale here was to ensure that
    we don't block other work items for too long, even if
    nf_conntrack_htable_size is huge. But in order to have some guarantee
    about maximum time period where a scan of the full conntrack table
    completes we should always use a fixed slice size, so that once every
    N scans the full table has been examined at least once.

    We also need to balance this vs. the case where the system is either idle
    (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
    from packet path).

    So, after some discussion with Nicolas:

    1. want hard guarantee that we scan entire table at least once every X s
    -> need to scan fraction of table (get rid of upper bound)

    2. don't want to eat cycles on idle or very busy system
    -> increase interval if we did not evict any entries

    3. don't want to block other worker items for too long
    -> make fraction really small, and prefer small scan interval instead

    4. Want reasonable short time where we detect timed-out entry when
    system went idle after a burst of traffic, while not doing scans
    all the time.
    -> Store next gc scan in worker, increasing delays when no eviction
    happened and shrinking delay when we see timed out entries.

    The old gc interval is turned into a max number, scans can now happen
    every jiffy if stale entries are present.

    Longest possible time period until an entry is evicted is now 2 minutes
    in worst case (entry expires right after it was deemed 'not expired').

    Reported-by: Nicolas Dichtel
    Signed-off-by: Florian Westphal
    Acked-by: Nicolas Dichtel
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Thomas reports its not possible to attach the H.245 helper:

    iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
    iptables: No chain/target/match by that name.
    xt_CT: No such helper "H.245"

    This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
    passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

    We should treat UNSPEC as wildcard and ignore the l3num instead.

    Reported-by: Thomas Woerner
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The (percpu) untracked conntrack entries can end up with nonzero connmarks.

    The 'untracked' conntrack objects are merely a way to distinguish INVALID
    (i.e. protocol connection tracker says payload doesn't meet some
    requirements or packet was never seen by the connection tracking code)
    from packets that are intentionally not tracked (some icmpv6 types such as
    neigh solicitation, or by using 'iptables -j CT --notrack' option).

    Untracked conntrack objects are implementation detail, we might as well use
    invalid magic address instead to tell INVALID and UNTRACKED apart.

    Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

    Reported-by: XU Tianwen
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • family.maxattr is the max index for policy[], the size of
    ops[] is determined with ARRAY_SIZE().

    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Cc: Pablo Neira Ayuso
    Signed-off-by: Cong Wang
    Signed-off-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    WANG Cong
     

08 Nov, 2016

4 commits

  • The display of /proc/net/route has had a couple issues due to the fact that
    when I originally rewrote most of fib_trie I made it so that the iterator
    was tracking the next value to use instead of the current.

    In addition it had an off by 1 error where I was tracking the first piece
    of data as position 0, even though in reality that belonged to the
    SEQ_START_TOKEN.

    This patch updates the code so the iterator tracks the last reported
    position and key instead of the next expected position and key. In
    addition it shifts things so that all of the leaves start at 1 instead of
    trying to report leaves starting with offset 0 as being valid. With these
    two issues addressed this should resolve any off by one errors that were
    present in the display of /proc/net/route.

    Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route")
    Cc: Andy Whitcroft
    Reported-by: Jason Baron
    Tested-by: Jason Baron
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Add Qualcomm QCA tagging introduced in cafdc45c9 to the
    list of supported protocols.

    Signed-off-by: Fabian Mewes
    Reviewed-by: Andrew Lunn
    Acked-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Fabian Mewes
     
  • Virtio 1.0 spec says VIRTIO_F_ANY_LAYOUT and VIRTIO_NET_F_GSO are
    legacy-only feature bits. Do not negotiate them in virtio 1 mode. Note
    this is a spec violation so we need to backport it to stable/downstream
    kernels.

    Cc: stable@vger.kernel.org
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Cornelia Huck
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • icmp6_send is called in response to some event. The skb may not have
    the device set (skb->dev is NULL), but it is expected to have a dst set.
    Update icmp6_send to use the dst on the skb to determine L3 domain.

    Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern