26 Mar, 2020

1 commit

  • net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
    pkt->skb->tc_redirected = 1;
    ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
    pkt->skb->tc_from_ingress = 1;
    ^~

    To avoid a direct dependency with tc actions from netfilter, wrap the
    redirect bits around CONFIG_NET_REDIRECT and move helpers to
    include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
    only existing client of these bits in the tree.

    This patch adds skb_set_redirected() that sets on the redirected bit
    on the skbuff, it specifies if the packet was redirect from ingress
    and resets the timestamp (timestamp reset was originally missing in the
    netfilter bugfix).

    Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
    Reported-by: noreply@ellerman.id.au
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

14 Mar, 2020

2 commits

  • Fix the handling of signals in client rxrpc calls made by the afs
    filesystem. Ignore signals completely, leaving call abandonment or
    connection loss to be detected by timeouts inside AF_RXRPC.

    Allowing a filesystem call to be interrupted after the entire request has
    been transmitted and an abort sent means that the server may or may not
    have done the action - and we don't know. It may even be worse than that
    for older servers.

    Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
    Signed-off-by: David Howells

    David Howells
     
  • Fix the interruptibility of kernel-initiated client calls so that they're
    either only interruptible when they're waiting for a call slot to come
    available or they're not interruptible at all. Either way, they're not
    interruptible during transmission.

    This should help prevent StoreData calls from being interrupted when
    writeback is in progress. It doesn't, however, handle interruption during
    the receive phase.

    Userspace-initiated calls are still interruptable. After the signal has
    been handled, sendmsg() will return the amount of data copied out of the
    buffer and userspace can perform another sendmsg() call to continue
    transmission.

    Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
    Signed-off-by: David Howells

    David Howells
     

04 Mar, 2020

1 commit


18 Feb, 2020

1 commit

  • tc flower rules that are based on src or dst port blocking are sometimes
    ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
    ports from the skb for tc flower to match against. However, the port
    dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
    key_control->flags. All callers of __skb_flow_dissect(), zero-out the
    key_control field except for fl_classify() as used by the flower
    classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
    __skb_flow_dissect(), since key_control is allocated on the stack
    and may not be initialized.

    Since key_basic and key_control are present for all flow keys, let's
    make sure they are initialized.

    Fixes: 62230715fd24 ("flow_dissector: do not dissect l4 ports for fragments")
    Co-developed-by: Eric Dumazet
    Signed-off-by: Eric Dumazet
    Acked-by: Cong Wang
    Signed-off-by: Jason Baron
    Signed-off-by: David S. Miller

    Jason Baron
     

17 Feb, 2020

1 commit

  • Fix all kernel-doc warnings for .
    Fixes these warnings:

    ../include/net/sock.h:232: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
    ../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'

    ../include/net/sock.h:498: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
    ../include/net/sock.h:498: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock'

    ../include/net/sock.h:2024: warning: No description found for return value of 'sk_wmem_alloc_get'
    ../include/net/sock.h:2035: warning: No description found for return value of 'sk_rmem_alloc_get'
    ../include/net/sock.h:2046: warning: No description found for return value of 'sk_has_allocations'
    ../include/net/sock.h:2082: warning: No description found for return value of 'skwq_has_sleeper'
    ../include/net/sock.h:2244: warning: No description found for return value of 'sk_page_frag'
    ../include/net/sock.h:2444: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE'
    ../include/net/sock.h:2444: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE'
    ../include/net/sock.h:2444: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE'

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

14 Feb, 2020

3 commits

  • …rnel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    Just a few fixes:
    * avoid running out of tracking space for frames that need
    to be reported to userspace by using more bits
    * fix beacon handling suppression by adding some relevant
    elements to the CRC calculation
    * fix quiet mode in action frames
    * fix crash in ethtool for virt_wifi and similar
    * add a missing policy entry
    * fix 160 & 80+80 bandwidth to take local capabilities into
    account
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • This introduces a helper function to be called only by network drivers
    that wraps calls to icmp[v6]_send in a conntrack transformation, in case
    NAT has been used. We don't want to pollute the non-driver path, though,
    so we introduce this as a helper to be called by places that actually
    make use of this, as suggested by Florian.

    Signed-off-by: Jason A. Donenfeld
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     
  • @thoff has moved to struct flow_dissector_key_control.

    Fixes: 42aecaa9bb2b ("net: Get skb hash over flow_keys structure")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

07 Feb, 2020

1 commit

  • It turns out that this wasn't a good idea, I hit a test failure in
    hwsim due to this. That particular failure was easily worked around,
    but it raised questions: if an AP needs to, for example, send action
    frames to each connected station, the current limit is nowhere near
    enough (especially if those stations are sleeping and the frames are
    queued for a while.)

    Shuffle around some bits to make more room for ack_frame_id to allow
    up to 8192 queued up frames, that's enough for queueing 4 frames to
    each connected station, even at the maximum of 2007 stations on a
    single AP.

    We take the bits from band (which currently only 2 but I leave 3 in
    case we add another band) and from the hw_queue, which can only need
    4 since it has a limit of 16 queues.

    Fixes: 6912daed05e1 ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est")
    Signed-off-by: Johannes Berg
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeid
    Signed-off-by: Johannes Berg

    Johannes Berg
     

05 Feb, 2020

1 commit

  • syzbot managed to send an IPX packet through bond_alb_xmit()
    and af_packet and triggered a use-after-free.

    First, bond_alb_xmit() was using ipx_hdr() helper to reach
    the IPX header, but ipx_hdr() was using the transport offset
    instead of the network offset. In the particular syzbot
    report transport offset was 0xFFFF

    This patch removes ipx_hdr() since it was only (mis)used from bonding.

    Then we need to make sure IPv4/IPv6/IPX headers are pulled
    in skb->head before dereferencing anything.

    BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
    Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
    (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)

    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    [] __dump_stack lib/dump_stack.c:17 [inline]
    [] dump_stack+0x14d/0x20b lib/dump_stack.c:53
    [] print_address_description+0x6f/0x20b mm/kasan/report.c:282
    [] kasan_report_error mm/kasan/report.c:380 [inline]
    [] kasan_report mm/kasan/report.c:438 [inline]
    [] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
    [] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
    [] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
    [] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
    [] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
    [] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
    [] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
    [] xmit_one net/core/dev.c:3611 [inline]
    [] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
    [] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
    [] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
    [] packet_snd net/packet/af_packet.c:3226 [inline]
    [] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
    [] sock_sendmsg_nosec net/socket.c:673 [inline]
    [] sock_sendmsg+0x12c/0x160 net/socket.c:684
    [] __sys_sendto+0x262/0x380 net/socket.c:1996
    [] SYSC_sendto net/socket.c:2008 [inline]
    [] SyS_sendto+0x40/0x60 net/socket.c:2004

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Jay Vosburgh
    Cc: Veaceslav Falico
    Cc: Andy Gospodarek
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jan, 2020

2 commits

  • If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

    This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
    selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

    As exporting a symbol for an empty function would be a bit wasteful, fix
    this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
    CONFIG_MPTCP_IPV6=n case.

    Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
    clear this is a pure-IPV6 function, just like mptcpv6_init().

    Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     
  • Commit 6cd021a58c18a ("udp: segment looped gso packets correctly")
    fixes an issue with rare udp gso multicast packets looped onto the
    receive path.

    The stable backport makes the narrowest change to target only these
    packets, when needed. As opposed to, say, expanding __udp_gso_segment,
    which is harder to reason to be free from unintended side-effects.

    But the resulting code is hardly self-describing.
    Document its purpose and rationale.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Jan, 2020

1 commit

  • Multicast and broadcast packets can be looped from egress to ingress
    pre segmentation with dev_loopback_xmit. That function unconditionally
    sets ip_summed to CHECKSUM_UNNECESSARY.

    udp_rcv_segment segments gso packets in the udp rx path. Segmentation
    usually executes on egress, and does not expect packets of this type.
    __udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
    the offsets are not correct for gso_make_checksum.

    UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
    to the correct pseudo header checksum. Reset ip_summed to this type.
    (CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)

    Reported-by: syzbot
    Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

27 Jan, 2020

7 commits

  • Add definition and documentation for the new generic info "fw.roce".

    v2: Remove board.nvm_cfg since fw.psid is similar.

    Cc: Jiri Pirko
    Cc: Jakub Kicinski
    Signed-off-by: Vasundhara Volam
    Signed-off-by: Michael Chan
    Signed-off-by: David S. Miller

    Vasundhara Volam
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2020-01-26

    Here's (probably) the last bluetooth-next pull request for the 5.6 kernel.

    - Initial pieces of Bluetooth 5.2 Isochronous Channels support
    - mgmt: Various cleanups and a new Set Blocked Keys command
    - btusb: Added support for 04ca:3021 QCA_ROME device
    - hci_qca: Multiple fixes & cleanups
    - hci_bcm: Fixes & improved device tree support
    - Fixed attempts to create duplicate debugfs entries

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • This patch extends UDP GRO to support fraglist GRO/GSO
    by using the previously introduced infrastructure.
    If the feature is enabled, all UDP packets are going to
    fraglist GRO (local input and forward).

    After validating the csum, we mark ip_summed as
    CHECKSUM_UNNECESSARY for fraglist GRO packets to
    make sure that the csum is not touched.

    Signed-off-by: Steffen Klassert
    Reviewed-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • The current implementations of ops->bind_class() are merely
    searching for classid and updating class in the struct tcf_result,
    without invoking either of cl_ops->bind_tcf() or
    cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
    like cbq use them to count filters too. This is why syzbot triggered
    the warning in cbq_destroy_class().

    In order to fix this, we have to call cl_ops->bind_tcf() and
    cl_ops->unbind_tcf() like the filter binding path. This patch does
    so by refactoring out two helper functions __tcf_bind_filter()
    and __tcf_unbind_filter(), which are lockless and accept a Qdisc
    pointer, then teaching each implementation to call them correctly.

    Note, we merely pass the Qdisc pointer as an opaque pointer to
    each filter, they only need to pass it down to the helper
    functions without understanding it at all.

    Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class")
    Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • This new set type allows for intervals in concatenated fields,
    which are expressed in the usual way, that is, simple byte
    concatenation with padding to 32 bits for single fields, and
    given as ranges by specifying start and end elements containing,
    each, the full concatenation of start and end values for the
    single fields.

    Ranges are expanded to composing netmasks, for each field: these
    are inserted as rules in per-field lookup tables. Bits to be
    classified are divided in 4-bit groups, and for each group, the
    lookup table contains 4^2 buckets, representing all the possible
    values of a bit group. This approach was inspired by the Grouper
    algorithm:
    http://www.cse.usf.edu/~ligatti/projects/grouper/

    Matching is performed by a sequence of AND operations between
    bucket values, with buckets selected according to the value of
    packet bits, for each group. The result of this sequence tells
    us which rules matched for a given field.

    In order to concatenate several ranged fields, per-field rules
    are mapped using mapping arrays, one per field, that specify
    which rules should be considered while matching the next field.
    The mapping array for the last field contains a reference to
    the element originally inserted.

    The notes in nft_set_pipapo.c cover the algorithm in deeper
    detail.

    A pure hash-based approach is of no use here, as ranges need
    to be classified. An implementation based on "proxying" the
    existing red-black tree set type, creating a tree for each
    field, was considered, but deemed impractical due to the fact
    that elements would need to be shared between trees, at least
    as long as we want to keep UAPI changes to a minimum.

    A stand-alone implementation of this algorithm is available at:
    https://pipapo.lameexcu.se
    together with notes about possible future optimisations
    (in pipapo.c).

    This algorithm was designed with data locality in mind, and can
    be highly optimised for SIMD instruction sets, as the bulk of
    the matching work is done with repetitive, simple bitwise
    operations.

    At this point, without further optimisations, nft_concat_range.sh
    reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
    L2$):

    TEST: performance
    net,port [ OK ]
    baseline (drop from netdev hook): 10190076pps
    baseline hash (non-ranged entries): 6179564pps
    baseline rbtree (match on first field only): 2950341pps
    set with 1000 full, ranged entries: 2304165pps
    port,net [ OK ]
    baseline (drop from netdev hook): 10143615pps
    baseline hash (non-ranged entries): 6135776pps
    baseline rbtree (match on first field only): 4311934pps
    set with 100 full, ranged entries: 4131471pps
    net6,port [ OK ]
    baseline (drop from netdev hook): 9730404pps
    baseline hash (non-ranged entries): 4809557pps
    baseline rbtree (match on first field only): 1501699pps
    set with 1000 full, ranged entries: 1092557pps
    port,proto [ OK ]
    baseline (drop from netdev hook): 10812426pps
    baseline hash (non-ranged entries): 6929353pps
    baseline rbtree (match on first field only): 3027105pps
    set with 30000 full, ranged entries: 284147pps
    net6,port,mac [ OK ]
    baseline (drop from netdev hook): 9660114pps
    baseline hash (non-ranged entries): 3778877pps
    baseline rbtree (match on first field only): 3179379pps
    set with 10 full, ranged entries: 2082880pps
    net6,port,mac,proto [ OK ]
    baseline (drop from netdev hook): 9718324pps
    baseline hash (non-ranged entries): 3799021pps
    baseline rbtree (match on first field only): 1506689pps
    set with 1000 full, ranged entries: 783810pps
    net,mac [ OK ]
    baseline (drop from netdev hook): 10190029pps
    baseline hash (non-ranged entries): 5172218pps
    baseline rbtree (match on first field only): 2946863pps
    set with 1000 full, ranged entries: 1279122pps

    v4:
    - fix build for 32-bit architectures: 64-bit division needs
    div_u64() (kbuild test robot )
    v3:
    - rework interface for field length specification,
    NFT_SET_SUBKEY disappears and information is stored in
    description
    - remove scratch area to store closing element of ranges,
    as elements now come with an actual attribute to specify
    the upper range limit (Pablo Neira Ayuso)
    - also remove pointer to 'start' element from mapping table,
    closing key is now accessible via extension data
    - use bytes right away instead of bits for field lengths,
    this way we can also double the inner loop of the lookup
    function to take care of upper and lower bits in a single
    iteration (minor performance improvement)
    - make it clearer that set operations are actually atomic
    API-wise, but we can't e.g. implement flush() as one-shot
    action
    - fix type for 'dup' in nft_pipapo_insert(), check for
    duplicates only in the next generation, and in general take
    care of differentiating generation mask cases depending on
    the operation (Pablo Neira Ayuso)
    - report C implementation matching rate in commit message, so
    that AVX2 implementation can be compared (Pablo Neira Ayuso)
    v2:
    - protect access to scratch maps in nft_pipapo_lookup() with
    local_bh_disable/enable() (Florian Westphal)
    - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
    already implied (Florian Westphal)
    - explain why partial allocation failures don't need handling
    in pipapo_realloc_scratch(), rename 'm' to clone and update
    related kerneldoc to make it clear we're not operating on
    the live copy (Florian Westphal)
    - add expicit check for priv->start_elem in
    nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
    with a NULL start element, and also zero it out in every
    operation that might make it invalid, so that insertion
    doesn't proceed with an invalid element (Florian Westphal)

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     
  • Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
    to specify the length of each field in a set concatenation.

    This allows set implementations to support concatenation of multiple
    ranged items, as they can divide the input key into matching data for
    every single field. Such set implementations would be selected as
    they specify support for NFT_SET_INTERVAL and allow desc->field_count
    to be greater than one. Explicitly disallow this for nft_set_rbtree.

    In order to specify the interval for a set entry, userspace would
    include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
    range endpoints as two separate keys, represented by attributes
    NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

    While at it, export the number of 32-bit registers available for
    packet matching, as nftables will need this to know the maximum
    number of field lengths that can be specified.

    For example, "packets with an IPv4 address between 192.0.2.0 and
    192.0.2.42, with destination port between 22 and 25", can be
    expressed as two concatenated elements:

    NFTA_SET_ELEM_KEY: 192.0.2.0 . 22
    NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25

    and NFTA_SET_DESC_CONCAT attribute would contain:

    NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN: 4
    NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN: 2

    v4: No changes
    v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
    v2: No changes

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     
  • Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
    interval between kernel and userspace.

    This patch also adds the NFT_SET_EXT_KEY_END extension to store the
    closing element value in this interval.

    v4: No changes
    v3: New patch

    [sbrivio: refactor error paths and labels; add corresponding
    nft_set_ext_type for new key; rebase]
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

26 Jan, 2020

1 commit


25 Jan, 2020

3 commits

  • Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
    and dumping of TBF Qdisc.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • We need to initialise the struct ourselves, else we expose tcp-specific
    callbacks such as tcp_splice_read which will then trigger splat because
    the socket is an mptcp one:

    BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
    Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478

    CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
    Call Trace:
    tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
    tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
    tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
    tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
    do_splice+0x1259/0x1560 fs/splice.c:1205

    To prevent build error with ipv6, add the recv/sendmsg function
    declaration to ipv6.h. The functions are already accessible "thanks"
    to retpoline related work, but they are currently only made visible
    by socket.c specific INDIRECT_CALLABLE macros.

    Reported-by: Christoph Paasch
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This patch introduces a list of pending module requests. This new module
    list is composed of nft_module_request objects that contain the module
    name and one status field that tells if the module has been already
    loaded (the 'done' field).

    In the first pass, from the preparation phase, the netlink command finds
    that a module is missing on this list. Then, a module request is
    allocated and added to this list and nft_request_module() returns
    -EAGAIN. This triggers the abort path with the autoload parameter set on
    from nfnetlink, request_module() is called and the module request enters
    the 'done' state. Since the mutex is released when loading modules from
    the abort phase, the module list is zapped so this is iteration occurs
    over a local list. Therefore, the request_module() calls happen when
    object lists are in consistent state (after fulling aborting the
    transaction) and the commit list is empty.

    On the second pass, the netlink command will find that it already tried
    to load the module, so it does not request it again and
    nft_request_module() returns 0. Then, there is a look up to find the
    object that the command was missing. If the module was successfully
    loaded, the command proceeds normally since it finds the missing object
    in place, otherwise -ENOENT is reported to userspace.

    This patch also updates nfnetlink to include the reason to enter the
    abort phase, which is required for this new autoload module rationale.

    Fixes: ec7470b834fe ("netfilter: nf_tables: store transaction list locally while requesting module")
    Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

24 Jan, 2020

6 commits

  • This implements MP_CAPABLE options parsing and writing according
    to RFC 6824 bis / RFC 8684: MPTCP v1.

    Local key is sent on syn/ack, and both keys are sent on 3rd ack.
    MP_CAPABLE messages len are updated accordingly. We need the skbuff to
    correctly emit the above, so we push the skbuff struct as an argument
    all the way from tcp code to the relevant mptcp callbacks.

    When processing incoming MP_CAPABLE + data, build a full blown DSS-like
    map info, to simplify later processing. On child socket creation, we
    need to record the remote key, if available.

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     
  • Parses incoming DSS options and populates outgoing MPTCP ACK
    fields. MPTCP fields are parsed from the TCP option header and placed in
    an skb extension, allowing the upper MPTCP layer to access MPTCP
    options after the skb has gone through the TCP stack.

    The subflow implements its own data_ready() ops, which ensures that the
    pending data is in sequence - according to MPTCP seq number - dropping
    out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
    This allows the MPTCP socket layer to determine if more data is
    available without having to consult the individual subflows.

    It additionally validates the current mapping and propagates EoF events
    to the connection socket.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Co-developed-by: Davide Caratti
    Signed-off-by: Davide Caratti
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Per-packet metadata required to write the MPTCP DSS option is written to
    the skb_ext area. One write to the socket may contain more than one
    packet of data, which is copied to page fragments and mapped in to MPTCP
    DSS segments with size determined by the available page fragments and
    the maximum mapping length allowed by the MPTCP specification. If
    do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
    the later skbs can either have duplicated DSS mapping information or
    none at all, and the receiver can handle that.

    The current implementation uses the subflow frag cache and tcp
    sendpages to avoid excessive code duplication. More work is required to
    ensure that it works correctly under memory pressure and to support
    MPTCP-level retransmissions.

    The MPTCP DSS checksum is not yet implemented.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
    to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
    the final ACK of the three-way handshake.

    Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
    responding SYN-ACK is received and notify the MPTCP connection layer.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Peter Krystad
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add hooks to parse and format the MP_CAPABLE option.

    This option is handled according to MPTCP version 0 (RFC6824).
    MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
    coordination with related code changes.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Davide Caratti
    Signed-off-by: Davide Caratti
    Signed-off-by: Peter Krystad
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Implements the infrastructure for MPTCP sockets.

    MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
    sockets are only managed by the MPTCP socket that owns them and are not
    visible from userspace. This commit allows a userspace program to open
    an MPTCP socket with:

    sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

    The resulting socket is simply a wrapper around a single regular TCP
    socket, without any of the MPTCP protocol implemented over the wire.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Mat Martineau
     

23 Jan, 2020

9 commits

  • Principles:
    - Packets are classified on flows.
    - This is a Stochastic model (as we use a hash, several flows might
    be hashed to the same slot)
    - Each flow has a PIE managed queue.
    - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
    - For a given flow, packets are not reordered.
    - Drops during enqueue only.
    - ECN capability is off by default.
    - ECN threshold (if ECN is enabled) is at 10% by default.
    - Uses timestamps to calculate queue delay by default.

    Usage:
    tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
    [ target TIME ] [ tupdate TIME ]
    [ alpha NUMBER ] [ beta NUMBER ]
    [ quantum BYTES ] [ memory_limit BYTES ]
    [ ecnprob PERCENTAGE ] [ [no]ecn ]
    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

    defaults:
    limit: 10240 packets, flows: 1024
    target: 15 ms, tupdate: 15 ms (in jiffies)
    alpha: 1/8, beta : 5/4
    quantum: device MTU, memory_limit: 32 Mb
    ecnprob: 10%, ecn: off
    bytemode: off, dq_rate_estimator: off

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Sachin D. Patil
    Signed-off-by: V. Saicharan
    Signed-off-by: Mohit Bhasi
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • This patch makes the drop_early(), calculate_probability() and
    pie_process_dequeue() functions generic enough to be used by
    both PIE and FQ-PIE (to be added in a future commit). The major
    change here is in the way the functions take in arguments. This
    patch exports these functions and makes FQ-PIE dependent on
    sch_pie.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Improve the comments along with the commenting style used to
    describe the members of the structures and their initial values
    in the init functions.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Rearrange the members of the structure such that closely
    referenced members appear together and/or fit in the same
    cacheline. Also, change the order of their initializations to
    match the order in which they appear in the structure.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Linux best practice recommends using u8 for true/false values in
    structures.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Rearrange macros in order of length and align the values to
    improve readability.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Use the U64_MAX macro to denote the constant (2^64 - 1).

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • This patch moves macros, structures and small functions common
    to PIE and FQ-PIE (to be added in a future commit) from the file
    net/sched/sch_pie.c to the header file include/net/pie.h.
    All the moved functions are made inline.

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-01-22

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 92 non-merge commits during the last 16 day(s) which contain
    a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

    The main changes are:

    1) function by function verification and program extensions from Alexei.

    2) massive cleanup of selftests/bpf from Toke and Andrii.

    3) batched bpf map operations from Brian and Yonghong.

    4) tcp congestion control in bpf from Martin.

    5) bulking for non-map xdp_redirect form Toke.

    6) bpf_send_signal_thread helper from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller