28 Sep, 2016

4 commits


26 Sep, 2016

6 commits

  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2016-09-25

    Here are a few more Bluetooth & 802.15.4 patches for the 4.9 kernel that
    have popped up during the past week:

    - New USB ID for QCA_ROME Bluetooth device
    - NULL pointer dereference fix for Bluetooth mgmt sockets
    - Fixes for BCSP driver
    - Fix for updating LE scan response

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Conflicts:
    net/netfilter/core.c
    net/netfilter/nf_tables_netdev.c

    Resolve two conflicts before pull request for David's net-next tree:

    1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant
    ip_hdr assignment") from the net tree and commit ddc8b6027ad0
    ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

    2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and
    Aaron Conole's patches to replace list_head with single linked list.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
    here is not appropriate. Replace them with NF_LOG_XXX.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • NFTA_LOG_FLAGS attribute is already supported, but the related
    NF_LOG_XXX flags are not exposed to the userspace. So we cannot
    explicitly enable log flags to log uid, tcp sequence, ip options
    and so on, i.e. such rule "nft add rule filter output log uid"
    is not supported yet.

    So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
    order to keep consistent with other modules, change NF_LOG_MASK to
    refer to all supported log flags. On the other hand, add a new
    NF_LOG_DEFAULT_MASK to refer to the original default log flags.

    Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
    and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
    userspace.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Inverse ranges != [a,b] are not currently possible because rules are
    composites of && operations, and we need to express this:

    data < a || data > b

    This patch adds a new range expression. Positive ranges can be already
    through two cmp expressions:

    cmp(sreg, data, >=)
    cmp(sreg, data,

    Pablo Neira Ayuso
     
  • The introduction of TCP_NEW_SYN_RECV state, and the addition of request
    sockets to the ehash table seems to have broken the --transparent option
    of the socket match for IPv6 (around commit a9407000).

    Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
    listener, the --transparent option tries to match on the no_srccheck flag
    of the request socket.

    Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
    by copying the transparent flag of the listener socket. This effectively
    causes '-m socket --transparent' not match on the ACK packet sent by the
    client in a TCP handshake.

    Based on the suggestion from Eric Dumazet, this change moves the code
    initializing no_srccheck to tcp_conn_request(), rendering the above
    scenario working again.

    Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
    Signed-off-by: Alex Badics
    Signed-off-by: KOVACS Krisztian
    Signed-off-by: Pablo Neira Ayuso

    KOVACS Krisztian
     

25 Sep, 2016

22 commits

  • Fabian reports a possible conntrack memory leak (could not reproduce so
    far), however, one minor issue can be easily resolved:

    > cat /proc/net/nf_conntrack | wc -l = 5
    > 4 minutes required to clean up the table.

    We should not report those timed-out entries to the user in first place.
    And instead of just skipping those timed-out entries while iterating over
    the table we can also zap them (we already do this during ctnetlink
    walks, but I forgot about the /proc interface).

    Fixes: f330a7fdbe16 ("netfilter: conntrack: get rid of conntrack timer")
    Reported-by: Fabian Frederick
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Create a new revision for the hashlimit iptables extension module. Rev 2
    will support higher pps of upto 1 million, Version 1 supports only 10k.

    To support this we have to increase the size of the variables avg and
    burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
    and xt_hashlimit_mtinfo2 and also create newer versions of all the
    functions for match, checkentry and destroy.

    Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
    similar in both rev1 and rev2 with only minor changes, so I have split
    those functions and moved all the common code to a *_common function.

    Signed-off-by: Vishwanath Pai
    Signed-off-by: Joshua Hunt
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     
  • I am planning to add a revision 2 for the hashlimit xtables module to
    support higher packets per second rates. This patch renames all the
    functions and variables related to revision 1 by adding _v1 at the
    end of the names.

    Signed-off-by: Vishwanath Pai
    Signed-off-by: Joshua Hunt
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     
  • NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
    specified, report EINVAL to the userspace. This validation check was
    already done at nft_ct_get_init, but we missed it in nft_ct_set_init.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Currently, if the user want to match ct l3proto, we must specify the
    direction, for example:
    # nft add rule filter input ct original l3proto ipv4
    ^^^^^^^^
    Otherwise, error message will be reported:
    # nft add rule filter input ct l3proto ipv4
    nft add rule filter input ct l3proto ipv4
    :1:1-38: Error: Could not process rule: Invalid argument
    add rule filter input ct l3proto ipv4
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Actually, there's no need to require NFTA_CT_DIRECTION attr, because
    ct l3proto and protocol are unrelated to direction.

    And for compatibility, even if the user specify the NFTA_CT_DIRECTION
    attr, do not report error, just skip it.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • It is valid that the TCP RST packet which does not set ack flag, and bytes
    of ack number are zero. But current seqadj codes would adjust the "0" ack
    to invalid ack number. Actually seqadj need to check the ack flag before
    adjust it for these RST packets.

    The following is my test case

    client is 10.26.98.245, and add one iptable rule:
    iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
    --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
    tcp-reset
    This iptables rule could generate on TCP RST without ack flag.

    server:10.172.135.55
    Enable the synproxy with seqadjust by the following iptables rules
    iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
    -m tcp --syn -j CT --notrack

    iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
    --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
    --mss 1460
    iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
    --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

    The following is my test result.

    1. packet trace on client
    root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
    win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
    length 0
    IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
    ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
    nop,wscale 7], length 0
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
    options [nop,nop,TS val 452367885 ecr 15643479], length 0
    IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
    options [nop,nop,TS val 15643479 ecr 452367885], length 0
    IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
    win 0, length 0

    2. seqadj log on server
    [62873.867319] Adjusting sequence number from 602341895->546723267,
    ack from 3695959830->3695959830
    [62873.867644] Adjusting sequence number from 602341895->546723267,
    ack from 3695959830->3695959830
    [62873.869040] Adjusting sequence number from 3695959830->3695959830,
    ack from 0->55618628

    To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
    one TCP RST packet without ack.

    Signed-off-by: Gao Feng
    Signed-off-by: Pablo Neira Ayuso

    Gao Feng
     
  • The netfilter hook list never uses the prev pointer, and so can be trimmed to
    be a simple singly-linked list.

    In addition to having a more light weight structure for hook traversal,
    struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
    2176 bytes (down from 2240).

    Signed-off-by: Aaron Conole
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Aaron Conole
     
  • …git/dhowells/linux-fs

    David Howells says:

    ====================
    rxrpc: Implement slow-start and other bits

    This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
    improve performance and handling of occasional packet loss. This is more or
    less the same as TCP slow start [RFC 5681]. Firstly, there are some ACK
    generation improvements:

    (1) Send ACKs regularly to apprise the peer of our state so that they can do
    congestion management of their own.

    (2) Send an ACK when we fill in a hole in the buffer so that the peer can
    find out that we did this thus forestalling retransmission.

    (3) Note the final DATA packet's serial number in the final ACK for
    correlation purposes.

    and a couple of bug fixes:

    (4) Reinitialise the ACK state and clear the ACK and resend timers upon
    entering the client reply reception phase to kill off any pending probe
    ACKs.

    (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

    and then there's the slow-start pieces:

    (6) Summarise an ACK.

    (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
    try and find out what happened to it.

    (8) Implement the slow start feature.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
    extract a __be32 value instead of nla_get_u32().

    Signed-off-by: Lance Richardson
    Signed-off-by: David S. Miller

    Lance Richardson
     
  • Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A
    tracepoint is added to log the state of the congestion management algorithm
    and the decisions it makes.

    Notes:

    (1) Since we send fixed-size DATA packets (apart from the final packet in
    each phase), counters and calculations are in terms of packets rather
    than bytes.

    (2) The ACK packet carries the equivalent of TCP SACK.

    (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
    suited to SACK of a small number of packets. It seems that, almost
    inevitably, by the time three 'duplicate' ACKs have been seen, we have
    narrowed the loss down to one or two missing packets, and the
    FLIGHT_SIZE calculation ends up as 2.

    (4) In rxrpc_resend(), if there was no data that apparently needed
    retransmission, we transmit a PING ACK to ask the peer to tell us what
    its Rx window state is.

    Signed-off-by: David Howells

    David Howells
     
  • If we've sent all the request data in a client call but haven't seen any
    sign of the reply data yet, schedule an ACK to be sent to the server to
    find out if the reply data got lost.

    If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
    demand a response to find out whether we need to retransmit.

    If the server says it has received all of the data, we send an IDLE ACK to
    tell the server that we haven't received anything in the receive phase as
    yet.

    To make this work, a non-immediate PING ACK must carry a delay. I've chosen
    the same as the IDLE ACK for the moment.

    Signed-off-by: David Howells

    David Howells
     
  • Generate a summary of the Tx buffer packet state when an ACK is received
    for use in a later patch that does congestion management.

    Signed-off-by: David Howells

    David Howells
     
  • When determining the resend timer value, we have a value in nsec but the
    timer is in jiffies which may be a million or more times more coarse.
    nsecs_to_jiffies() rounds down - which means that the resend timeout
    expressed as jiffies is very likely earlier than the one expressed as
    nanoseconds from which it was derived.

    The problem is that rxrpc_resend() gets triggered by the timer, but can't
    then find anything to resend yet. It sets the timer again - but gets
    kicked off immediately again and again until the nanosecond-based expiry
    time is reached and we actually retransmit.

    Fix this by adding 1 to the jiffies-based resend_at value to counteract the
    rounding and make sure that the timer happens after the nanosecond-based
    expiry is passed.

    Alternatives would be to adjust the timestamp on the packets to align
    with the jiffie scale or to switch back to using jiffie-timestamps.

    Signed-off-by: David Howells

    David Howells
     
  • Clear the ACK reason, ACK timer and resend timer when entering the client
    reply phase when the first DATA packet is received. New ACKs will be
    proposed once the data is queued.

    The resend timer is no longer relevant and we need to cancel ACKs scheduled
    to probe for a lost reply.

    Signed-off-by: David Howells

    David Howells
     
  • In a client call, include the serial number of the last DATA packet of the
    reply in the final ACK.

    Signed-off-by: David Howells

    David Howells
     
  • Send an immediate ACK if we fill in a hole in the buffer left by an
    out-of-sequence packet. This may allow the congestion management in the peer
    to avoid a retransmission if packets got reordered on the wire.

    Signed-off-by: David Howells

    David Howells
     
  • This commit adds an upfront check for sane values to be passed when
    registering a netfilter hook. This will be used in a future patch for a
    simplified hook list traversal.

    Signed-off-by: Aaron Conole
    Signed-off-by: Pablo Neira Ayuso

    Aaron Conole
     
  • All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
    cleanup removes the recursive call. This is just a cleanup, as the locking
    code gracefully handles this situation.

    Signed-off-by: Aaron Conole
    Signed-off-by: Pablo Neira Ayuso

    Aaron Conole
     
  • This commit ensures that the rcu read-side lock is held while the
    ingress hook is called. This ensures that a call to nf_hook_slow (and
    ultimately nf_ingress) will be read protected.

    Signed-off-by: Aaron Conole
    Signed-off-by: Pablo Neira Ayuso

    Aaron Conole
     
  • This replaces the last uses of NF_HOOK_THRESH().
    Followup patch will remove it and rename nf_hook_thresh.

    The reason is that inet (non-bridge) netfilter no longer invokes the
    hooks from hooks, so we do no longer need the thresh value to skip hooks
    with a lower priority.

    The bridge netfilter however may need to do this. br_nf_hook_thresh is a
    wrapper that is supposed to do this, i.e. only call hooks with a
    priority that exceeds NF_BR_PRI_BRNF.

    It's used only in the recursion cases of br_netfilter. It invokes
    nf_hook_slow while holding an rcu read-side critical section to make a
    future cleanup simpler.

    Signed-off-by: Florian Westphal
    Signed-off-by: Aaron Conole
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
    and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
    in_mtu) - minlen". It may let reader think about how about the result.
    Would it be negative.

    Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
    variable, then only perform one condition check, and it is more readable.

    Signed-off-by: Gao Feng
    Signed-off-by: Pablo Neira Ayuso

    Gao Feng
     
  • Send an ACK if we haven't sent one for the last two packets we've received.
    This keeps the other end apprised of where we've got to - which is
    important if they're doing slow-start.

    We do this in recvmsg so that we can dispatch a packet directly without the
    need to wake up the background thread.

    This should possibly be made configurable in future.

    Signed-off-by: David Howells

    David Howells
     

24 Sep, 2016

3 commits

  • …git/dhowells/linux-fs

    David Howells says:

    ====================
    rxrpc: Bug fixes and tracepoints

    Here are a bunch of bug fixes:

    (1) Need to set the timestamp on a Tx packet before queueing it to avoid
    trouble with the retransmission function.

    (2) Don't send an ACK at the end of the service reply transmission; it's
    the responsibility of the client to send an ACK to close the call.
    The service can resend the last DATA packet or send a PING ACK.

    (3) Wake sendmsg() on abnormal call termination.

    (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.

    (5) Use before_eq() & co. to compare serial numbers (which may wrap).

    (6) Start the resend timer on DATA packet transmission.

    (7) Don't accidentally cancel a retransmission upon receiving a NACK.

    (8) Fix the call timer setting function to deal with timeouts that are now
    or past.

    (9) Don't use a flag to communicate the presence of the last packet in the
    Tx buffer from sendmsg to the input routines where ACK and DATA
    reception is handled. The problem is that there's a window between
    queueing the last packet for transmission and setting the flag in
    which ACKs or reply DATA packets can arrive, causing apparent state
    machine violation issues.

    Instead use the annotation buffer to mark the last packet and pick up
    and set the flag in the input routines.

    (10) Don't call the tx_ack tracepoint and don't allocate a serial number if
    someone else nicked the ACK we were about to transmit.

    There are also new tracepoints and one altered tracepoint used to track
    down the above bugs:

    (11) Call timer tracepoint.

    (12) Data Tx tracepoint (and adjustments to ACK tracepoint).

    (13) Injected Rx packet loss tracepoint.

    (14) Ack proposal tracepoint.

    (15) Retransmission selection tracepoint.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2016-09-23

    Only two patches this time:

    1) Fix a comment reference to struct xfrm_replay_state_esn.
    From Richard Guy Briggs.

    2) Convert xfrm_state_lookup to rcu, we don't need the
    xfrm_state_lock anymore in the input path.
    From Florian Westphal.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
    the ability for user-space application to specify it for the VF, as an
    option to support 802.1ad.
    We adjusted IP Link tool to support this option.

    For future use cases, the new UAPI supports multiple vlans. For now we
    limit the list size to a single vlan in kernel.
    Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
    compatibility with older versions of IP Link tool.

    Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
    We kept 802.1Q as the drivers' default vlan protocol.
    Suitable ip link tool command examples:
    Set vf vlan protocol 802.1ad:
    ip link set eth0 vf 1 vlan 100 proto 802.1ad
    Set vf to VST (802.1Q) mode:
    ip link set eth0 vf 1 vlan 100 proto 802.1Q
    Or by omitting the new parameter
    ip link set eth0 vf 1 vlan 100

    Signed-off-by: Moshe Shemesh
    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Moshe Shemesh
     

23 Sep, 2016

5 commits