09 Sep, 2016

5 commits

  • Over the years, TCP BDP has increased by several orders of magnitude,
    and some people are considering to reach the 2 Gbytes limit.

    Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
    MSS.

    In presence of packet losses (or reorders), TCP stores incoming packets
    into an out of order queue, and number of skbs sitting there waiting for
    the missing packets to be received can be in the 10^5 range.

    Most packets are appended to the tail of this queue, and when
    packets can finally be transferred to receive queue, we scan the queue
    from its head.

    However, in presence of heavy losses, we might have to find an arbitrary
    point in this queue, involving a linear scan for every incoming packet,
    throwing away cpu caches.

    This patch converts it to a RB tree, to get bounded latencies.

    Yaogong wrote a preliminary patch about 2 years ago.
    Eric did the rebase, added ofo_last_skb cache, polishing and tests.

    Tested with network dropping between 1 and 10 % packets, with good
    success (about 30 % increase of throughput in stress tests)

    Next step would be to also use an RB tree for the write queue at sender
    side ;)

    Signed-off-by: Yaogong Wang
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Ilpo Järvinen
    Acked-By: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yaogong Wang
     
  • This is to simplify using double tagged vlans. This function allows all
    valid vlan ethertypes to be checked in a single function call.
    Also replace some instances that check for both ETH_P_8021Q and
    ETH_P_8021AD.

    Patch based on one originally by Thomas F Herbert.

    Signed-off-by: Thomas F Herbert
    Signed-off-by: Eric Garver
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Eric Garver
     
  • openvswitch: Add support for 8021.AD

    Change the description of the VLAN tpid field.

    Signed-off-by: Thomas F Herbert
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas F Herbert
     
  • This adds the capability for a process that has CAP_NET_ADMIN on
    a socket to see the socket mark in socket dumps.

    Commit a52e95abf772 ("net: diag: allow socket bytecode filters to
    match socket marks") recently gave privileged processes the
    ability to filter socket dumps based on mark. This patch is
    complementary: it ensures that the mark is also passed to
    userspace in the socket's netlink attributes. It is useful for
    tools like ss which display information about sockets.

    Tested: https://android-review.googlesource.com/270210
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     
  • Steffen Klassert says:

    ====================
    ipsec-next 2016-09-08

    1) Constify the xfrm_replay structures. From Julia Lawall

    2) Protect xfrm state hash tables with rcu, lookups
    can be done now without acquiring xfrm_state_lock.
    From Florian Westphal.

    3) Protect xfrm policy hash tables with rcu, lookups
    can be done now without acquiring xfrm_policy_lock.
    From Florian Westphal.

    4) We don't need to have a garbage collector list per
    namespace anymore, so use a global one instead.
    From Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Sep, 2016

2 commits


07 Sep, 2016

4 commits

  • Add a tracepoint for working out where local aborts happen. Each
    tracepoint call is labelled with a 3-letter code so that they can be
    distinguished - and the DATA sequence number is added too where available.

    rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
    indicate the circumstances when it aborts a call.

    Signed-off-by: David Howells

    David Howells
     
  • Improve the call tracking tracepoint by showing more differentiation
    between some of the put and get events, including:

    (1) Getting and putting refs for the socket call user ID tree.

    (2) Getting and putting refs for queueing and failing to queue the call
    processor work item.

    Note that these aren't necessarily used in this patch, but will be taken
    advantage of in future patches.

    An enum is added for the event subtype numbers rather than coding them
    directly as decimal numbers and a table of 3-letter strings is provided
    rather than a sequence of ?: operators.

    Signed-off-by: David Howells

    David Howells
     
  • …l/git/dhowells/linux-fs

    David Howells says:

    ====================
    rxrpc: Small fixes

    Here's a set of small fix patches:

    (1) Fix some uninitialised variables.

    (2) Set the client call state before making it live by attaching it to the
    conn struct.

    (3) Randomise the epoch and starting client conn ID values, and don't
    change the epoch when the client conn ID rolls round.

    (4) Replace deprecated create_singlethread_workqueue() calls.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. Most relevant updates are the removal of per-conntrack timers to
    use a workqueue/garbage collection approach instead from Florian
    Westphal, the hash and numgen expression for nf_tables from Laura
    Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
    removal of ip_conntrack sysctl and many other incremental updates on our
    Netfilter codebase.

    More specifically, they are:

    1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
    transport area in dccp, sctp, tcp, udp and udplite protocol
    conntrackers, from Gao Feng.

    2) Missing whitespace on error message in physdev match, from Hangbin Liu.

    3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

    4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

    5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
    from Florian.

    6) Rename nf_tables set implementation to nft_set_{name}.c

    7) Introduce the hash expression to allow arbitrary hashing of selector
    concatenations, from Laura Garcia Liebana.

    8) Remove ip_conntrack sysctl backward compatibility code, this code has
    been around for long time already, and we have two interfaces to do
    this already: nf_conntrack sysctl and ctnetlink.

    9) Use nf_conntrack_get_ht() helper function whenever possible, instead
    of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

    10) Add quota expression for nf_tables.

    11) Add number generator expression for nf_tables, this supports
    incremental and random generators that can be combined with maps,
    very useful for load balancing purpose, again from Laura Garcia Liebana.

    12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

    13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
    configuration, this is used by a follow up patch to perform better chain
    update validation.

    14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
    nft_set_hash implementation to honor the NLM_F_EXCL flag.

    15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
    patch from Florian Westphal.

    16) Don't use the DYING bit to know if the conntrack event has been already
    delivered, instead a state variable to track event re-delivery
    states, also from Florian.

    17) Remove the per-conntrack timer, use the workqueue approach that was
    discussed during the NFWS, from Florian Westphal.

    18) Use the netlink conntrack table dump path to kill stale entries,
    again from Florian.

    19) Add a garbage collector to get rid of stale conntracks, from
    Florian.

    20) Reschedule garbage collector if eviction rate is high.

    21) Get rid of the __nf_ct_kill_acct() helper.

    22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

    23) Make nf_log_set() interface assertive on unsupported families.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Sep, 2016

1 commit


03 Sep, 2016

3 commits

  • This patch fixes the retun value of switchdev_port_fdb_dump() when
    CONFIG_NET_SWITCHDEV is not set. This avoids getting "warning: return makes
    integer from pointer without a cast [-Wint-conversion]" when building
    when CONFIG_NET_SWITCHDEV is not set under several compiler versions.
    This warning is due to commit d297653dd6f07afbe7e6c702a4bcd7615680002e
    ("rtnetlink: fdb dump: optimize by saving last interface markers").

    Signed-off-by: Rami Rosen
    Acked-by: Roopa Prabhu
    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Rosen, Rami
     
  • Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
    via overflow_handler mechanism.
    When program is attached the overflow_handlers become stacked.
    The program acts as a filter.
    Returning zero from the program means that the normal perf_event_output handler
    will not be called and sampling event won't be stored in the ring buffer.

    The overflow_handler_context==NULL is an additional safety check
    to make sure programs are not attached to hw breakpoints and watchdog
    in case other checks (that prevent that now anyway) get accidentally
    relaxed in the future.

    The program refcnt is incremented in case perf_events are inhereted
    when target task is forked.
    Similar to kprobe and tracepoint programs there is no ioctl to
    detach the program or swap already attached program. The user space
    expected to close(perf_event_fd) like it does right now for kprobe+bpf.
    That restriction simplifies the code quite a bit.

    The invocation of overflow_handler in __perf_event_overflow() is now
    done via READ_ONCE, since that pointer can be replaced when the program
    is attached while perf_event itself could have been active already.
    There is no need to do similar treatment for event->prog, since it's
    assigned only once before it's accessed.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
    HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
    correspondingly in uapi/linux/perf_event.h)

    The program visible context meta structure is
    struct bpf_perf_event_data {
    struct pt_regs regs;
    __u64 sample_period;
    };
    which is accessible directly from the program:
    int bpf_prog(struct bpf_perf_event_data *ctx)
    {
    ... ctx->sample_period ...
    ... ctx->regs.ip ...
    }

    The bpf verifier rewrites the accesses into kernel internal
    struct bpf_perf_event_data_kern which allows changing
    struct perf_sample_data without affecting bpf programs.
    New fields can be added to the end of struct bpf_perf_event_data
    in the future.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Sep, 2016

5 commits

  • Access the priv member of the dsa_switch structure directly, instead of
    having an unnecessary helper.

    Signed-off-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Add a per-port flag to control the unknown multicast flood, similar to the
    unknown unicast flood flag and break a few long lines in the netlink flag
    exports.

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • fdb dumps spanning multiple skb's currently restart from the first
    interface again for every skb. This results in unnecessary
    iterations on the already visited interfaces and their fdb
    entries. In large scale setups, we have seen this to slow
    down fdb dumps considerably. On a system with 30k macs we
    see fdb dumps spanning across more than 300 skbs.

    To fix the problem, this patch replaces the existing single fdb
    marker with three markers: netdev hash entries, netdevs and fdb
    index to continue where we left off instead of restarting from the
    first netdev. This is consistent with link dumps.

    In the process of fixing the performance issue, this patch also
    re-implements fix done by
    commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
    (with an internal fix from Wilson Kok) in the following ways:
    - change ndo_fdb_dump handlers to return error code instead
    of the last fdb index
    - use cb->args strictly for dump frag markers and not error codes.
    This is consistent with other dump functions.

    Below results were taken on a system with 1000 netdevs
    and 35085 fdb entries:
    before patch:
    $time bridge fdb show | wc -l
    15065

    real 1m11.791s
    user 0m0.070s
    sys 1m8.395s

    (existing code does not return all macs)

    after patch:
    $time bridge fdb show | wc -l
    35085

    real 0m2.017s
    user 0m0.113s
    sys 0m1.942s

    Signed-off-by: Roopa Prabhu
    Signed-off-by: Wilson Kok
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Add the const for the parameter of flow_keys_have_l4 for the readability.

    Signed-off-by: Gao Feng
    Signed-off-by: David S. Miller

    Gao Feng
     
  • Don't expose skbs to in-kernel users, such as the AFS filesystem, but
    instead provide a notification hook the indicates that a call needs
    attention and another that indicates that there's a new call to be
    collected.

    This makes the following possibilities more achievable:

    (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

    (2) skbs referring to non-data events will be able to be freed much sooner
    rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
    will be able to consult the call state.

    (3) We can shortcut the receive phase when a call is remotely aborted
    because we don't have to go through all the packets to get to the one
    cancelling the operation.

    (4) It makes it easier to do encryption/decryption directly between AFS's
    buffers and sk_buffs.

    (5) Encryption/decryption can more easily be done in the AFS's thread
    contexts - usually that of the userspace process that issued a syscall
    - rather than in one of rxrpc's background threads on a workqueue.

    (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

    To make this work, the following interface function has been added:

    int rxrpc_kernel_recv_data(
    struct socket *sock, struct rxrpc_call *call,
    void *buffer, size_t bufsize, size_t *_offset,
    bool want_more, u32 *_abort_code);

    This is the recvmsg equivalent. It allows the caller to find out about the
    state of a specific call and to transfer received data into a buffer
    piecemeal.

    afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
    logic between them. They don't wait synchronously yet because the socket
    lock needs to be dealt with.

    Five interface functions have been removed:

    rxrpc_kernel_is_data_last()
    rxrpc_kernel_get_abort_code()
    rxrpc_kernel_get_error_number()
    rxrpc_kernel_free_skb()
    rxrpc_kernel_data_consumed()

    As a temporary hack, sk_buffs going to an in-kernel call are queued on the
    rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
    in-kernel user. To process the queue internally, a temporary function,
    temp_deliver_data() has been added. This will be replaced with common code
    between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
    future patch.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

01 Sep, 2016

1 commit


31 Aug, 2016

1 commit

  • Today mpls iptunnel lwtunnel_output redirect expects the tunnel
    output function to handle fragmentation. This is ok but can be
    avoided if we did not do the mpls output redirect too early.
    ie we could wait until ip fragmentation is done and then call
    mpls output for each ip fragment.

    To make this work we will need,
    1) the lwtunnel state to carry encap headroom
    2) and do the redirect to the encap output handler on the ip fragment
    (essentially do the output redirect after fragmentation)

    This patch adds tunnel headroom in lwtstate to make sure we
    account for tunnel data in mtu calculations during fragmentation
    and adds new xmit redirect handler to redirect to lwtunnel xmit func
    after ip fragmentation.

    This includes IPV6 and some mtu fixes and testing from David Ahern.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

30 Aug, 2016

9 commits

  • Pass struct socket * to more rxrpc kernel interface functions. They should
    be starting from this rather than the socket pointer in the rxrpc_call
    struct if they need to access the socket.

    I have left:

    rxrpc_kernel_is_data_last()
    rxrpc_kernel_get_abort_code()
    rxrpc_kernel_get_error_number()
    rxrpc_kernel_free_skb()
    rxrpc_kernel_data_consumed()

    unmodified as they're all about to be removed (and, in any case, don't
    touch the socket).

    Signed-off-by: David Howells

    David Howells
     
  • Provide a function so that kernel users, such as AFS, can ask for the peer
    address of a call:

    void rxrpc_kernel_get_peer(struct rxrpc_call *call,
    struct sockaddr_rxrpc *_srx);

    In the future the kernel service won't get sk_buffs to look inside.
    Further, this allows us to hide any canonicalisation inside AF_RXRPC for
    when IPv6 support is added.

    Also propagate this through to afs_find_server() and issue a warning if we
    can't handle the address family yet.

    Signed-off-by: David Howells

    David Howells
     
  • Add a trace event for debuging rxrpc_call struct usage.

    Signed-off-by: David Howells

    David Howells
     
  • The nf_log_set is an interface function, so it should do the strict sanity
    check of parameters. Convert the return value of nf_log_set as int instead
    of void. When the pf is invalid, return -EOPNOTSUPP.

    Signed-off-by: Gao Feng
    Signed-off-by: Pablo Neira Ayuso

    Gao Feng
     
  • After timer removal this just calls nf_ct_delete so remove the __ prefix
    version and make nf_ct_kill a shorthand for nf_ct_delete.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
    Eric Dumazet pointed out during netfilter workshop 2016.

    Eric also says: "Another reason was the fact that Thomas was about to
    change max timer range [..]" (500462a9de657f8, 'timers: Switch to
    a non-cascading wheel').

    Remove the timer and use a 32bit jiffies value containing timestamp until
    entry is valid.

    During conntrack lookup, even before doing tuple comparision, check
    the timeout value and evict the entry in case it is too old.

    The dying bit is used as a synchronization point to avoid races where
    multiple cpus try to evict the same entry.

    Because lookup is always lockless, we need to bump the refcnt once
    when we evict, else we could try to evict already-dead entry that
    is being recycled.

    This is the standard/expected way when conntrack entries are destroyed.

    Followup patches will introduce garbage colliction via work queue
    and further places where we can reap obsoleted entries (e.g. during
    netlink dumps), this is needed to avoid expired conntracks from hanging
    around for too long when lookup rate is low after a busy period.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The reliable event delivery mode currently (ab)uses the DYING bit to
    detect which entries on the dying list have to be skipped when
    re-delivering events from the eache worker in reliable event mode.

    Currently when we delete the conntrack from main table we only set this
    bit if we could also deliver the netlink destroy event to userspace.

    If we fail we move it to the dying list, the ecache worker will
    reattempt event delivery for all confirmed conntracks on the dying list
    that do not have the DYING bit set.

    Once timer is gone, we can no longer use if (del_timer()) to detect
    when we 'stole' the reference count owned by the timer/hash entry, so
    we need some other way to avoid racing with other cpu.

    Pablo suggested to add a marker in the ecache extension that skips
    entries that have been unhashed from main table but are still waiting
    for the last reference count to be dropped (e.g. because one skb waiting
    on nfqueue verdict still holds a reference).

    We do this by adding a tristate.
    If we fail to deliver the destroy event, make a note of this in the
    eache extension. The worker can then skip all entries that are in
    a different state. Either they never delivered a destroy event,
    e.g. because the netlink backend was not loaded, or redelivery took
    place already.

    Once the conntrack timer is removed we will now be able to replace
    del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
    racing with other cpu that tries to evict the same conntrack.

    Because DYING will then be set right before we report the destroy event
    we can no longer skip event reporting when dying bit is set.

    Suggested-by: Pablo Neira Ayuso
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • All three conflicts were cases of simple overlapping
    changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) Segregate namespaces properly in conntrack dumps, from Liping Zhang.

    2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet.

    3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.

    4) Fix use-after-free in tcp_xmit_retransmit_queue().

    5) Userspace header fixups (use of __u32, missing includes, etc.) from
    Mikko Rapeli.

    6) Further refinements to fragmentation wrt gso and tunnels, from
    Shmulik Ladkani.

    7) Trigger poll correctly for zero length UDP packets, from Eric
    Dumazet.

    8) TCP window scaling fix, also from Eric Dumazet.

    9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.

    10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.

    11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.

    12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet.

    13) Add new device ID to alx driver, from Owen Lin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
    Add Killer E2500 device ID in alx driver.
    net: smc91x: fix SMC accesses
    Documentation: networking: dsa: Remove platform device TODO
    net/mlx5: Increase number of ethtool steering priorities
    net/mlx5: Add error prints when validate ETS failed
    net/mlx5e: Fix memory leak if refreshing TIRs fails
    net/mlx5e: Add ethtool counter for TX xmit_more
    net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
    net/mlx5e: Don't wait for SQ completions on close
    net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
    net/mlx5e: Don't wait for RQ completions on close
    net/mlx5e: Limit UMR length to the device's limitation
    rhashtable: fix a memory leak in alloc_bucket_locks()
    sfc: fix potential stack corruption from running past stat bitmask
    team: loadbalance: push lacpdus to exact delivery
    net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
    8139cp: Fix one possible deadloop in cp_rx_poll
    i40e: Change some init flow for the client
    Revert "phy: IRQ cannot be shared"
    net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
    ...

    Linus Torvalds
     

29 Aug, 2016

7 commits

  • This patch enhances ethtool link mode bitmap to include
    missing interface modes for 1G/10G speeds

    Changes:
    1000baseX is the mode introduced to cover all 1G Fiber cases.
    All modes under 1000BaseX i.e. 1000BASE-SX, 1000BASE-LX, 1000BASE-LX10
    and 1000BASE-BX10 are not explicitly defined at this moment.
    10G CR,SR,LR and ER link modes are included for 10G speed..

    Issue:
    ethtool on 1G/10G SFP port reports Base-T
    as this port supports 1000baseX,10G CR, SR and LR modes.

    root@tor-02$ ethtool swp1
    Settings for swp1:
    Supported ports: [ FIBRE ]
    Supported link modes: 1000baseT/Full
    10000baseT/Full
    Supported pause frame use: Symmetric Receive-only
    Supports auto-negotiation: Yes
    Advertised link modes: 1000baseT/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Speed: 10000Mb/s
    Duplex: Full
    Port: FIBRE
    PHYAD: 0
    Transceiver: external
    Auto-negotiation: off
    Current message level: 0x00000000 (0)

    Link detected: yes

    After fix:
    root@tor-02$ ethtool swp1
    Settings for swp1:
    Supported ports: [ FIBRE ]
    Supported link modes: 1000baseX/Full
    10000baseCR/Full
    10000baseSR/Full
    10000baseLR/Full
    10000baseER/Full
    Supported pause frame use: Symmetric Receive-only
    Supports auto-negotiation: Yes
    Advertised link modes: 1000baseT/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Speed: 10000Mb/s
    Duplex: Full
    Port: FIBRE
    PHYAD: 0
    Transceiver: external
    Auto-negotiation: off
    Current message level: 0x00000000 (0)
    Link detected: yes

    Signed-off-by: Vidya Sagar Ravipati
    Signed-off-by: David S. Miller

    Vidya Sagar Ravipati
     
  • When TCP operates in lossy environments (between 1 and 10 % packet
    losses), many SACK blocks can be exchanged, and I noticed we could
    drop them on busy senders, if these SACK blocks have to be queued
    into the socket backlog.

    While the main cause is the poor performance of RACK/SACK processing,
    we can try to avoid these drops of valuable information that can lead to
    spurious timeouts and retransmits.

    Cause of the drops is the skb->truesize overestimation caused by :

    - drivers allocating ~2048 (or more) bytes as a fragment to hold an
    Ethernet frame.

    - various pskb_may_pull() calls bringing the headers into skb->head
    might have pulled all the frame content, but skb->truesize could
    not be lowered, as the stack has no idea of each fragment truesize.

    The backlog drops are also more visible on bidirectional flows, since
    their sk_rmem_alloc can be quite big.

    Let's add some room for the backlog, as only the socket owner
    can selectively take action to lower memory needs, like collapsing
    receive queues or partial ofo pruning.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Commit b70661c70830 ("net: smc91x: use run-time configuration on all ARM
    machines") broke some ARM platforms through several mistakes. Firstly,
    the access size must correspond to the following rule:

    (a) at least one of 16-bit or 8-bit access size must be supported
    (b) 32-bit accesses are optional, and may be enabled in addition to
    the above.

    Secondly, it provides no emulation of 16-bit accesses, instead blindly
    making 16-bit accesses even when the platform specifies that only 8-bit
    is supported.

    Reorganise smc91x.h so we can make use of the existing 16-bit access
    emulation already provided - if 16-bit accesses are supported, use
    16-bit accesses directly, otherwise if 8-bit accesses are supported,
    use the provided 16-bit access emulation. If neither, BUG(). This
    exactly reflects the driver behaviour prior to the commit being fixed.

    Since the conversion incorrectly cut down the available access sizes on
    several platforms, we also need to go through every platform and fix up
    the overly-restrictive access size: Arnd assumed that if a platform can
    perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
    size needed to be specified - not so, all available access sizes must
    be specified.

    This likely fixes some performance regressions in doing this: if a
    platform does not support 8-bit accesses, 8-bit accesses have been
    emulated by performing a 16-bit read-modify-write access.

    Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
    accesses, which was broken by the original commit.

    Fixes: b70661c70830 ("net: smc91x: use run-time configuration on all ARM machines")
    Signed-off-by: Russell King
    Tested-by: Robert Jarzmik
    Signed-off-by: David S. Miller

    Russell King
     
  • kcm and strparser need to work with any type of stream socket not just
    TCP. Eliminate references to TCP and call generic proto_ops functions of
    read_sock and peek_len. Also in strp_init check if the socket support
    the proto_ops read_sock and peek_len.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
    tcp_peek_len (which is just a stub function that calls tcp_inq).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add new function in proto_ops structure. This includes moving the
    typedef got sk_read_actor into net.h and removing the definition from
    tcp.h.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Pull drm fixes from Dave Airlie:
    "A bunch of fixes covering i915, amdgpu, one tegra and some core DRM
    ones. Nothing too strange at this point"

    * tag 'drm-fixes-for-4.8-rc4' of git://people.freedesktop.org/~airlied/linux: (21 commits)
    drm/atomic: Don't potentially reset color_mgmt_changed on successive property updates.
    drm: Protect fb_defio in drivers with CONFIG_KMS_FBDEV_EMULATION
    drm/amdgpu: skip TV/CV in display parsing
    drm/amdgpu: avoid a possible array overflow
    drm/amdgpu: fix lru size grouping v2
    drm/tegra: dsi: Enhance runtime power management
    drm/i915: Fix botched merge that downgrades CSR versions.
    drm/i915/skl: Ensure pipes with changed wms get added to the state
    drm/i915/gen9: Only copy WM results for changed pipes to skl_hw
    drm/i915/skl: Add support for the SAGV, fix underrun hangs
    drm/i915/gen6+: Interpret mailbox error flags
    drm/i915: Reattach comment, complete type specification
    drm/i915: Unconditionally flush any chipset buffers before execbuf
    drm/i915/gen9: Drop invalid WARN() during data rate calculation
    drm/i915/gen9: Initialize intel_state->active_crtcs during WM sanitization (v2)
    drm: Reject page_flip for !DRIVER_MODESET
    drm/amdgpu: fix timeout value check in amd_sched_job_recovery
    drm/amdgpu: fix sdma_v2_4_ring_test_ib
    drm/amdgpu: fix amdgpu_move_blit on 32bit systems
    drm/radeon: fix radeon_move_blit on 32bit systems
    ...

    Linus Torvalds
     

28 Aug, 2016

1 commit

  • Pull KVM fixes from Paolo Bonzini:
    "ARM:
    - fixes for ITS init issues, error handling, IRQ leakage, race
    conditions
    - an erratum workaround for timers
    - some removal of misleading use of errors and comments
    - a fix for GICv3 on 32-bit guests

    MIPS:
    - fix for where the guest could wrongly map the first page of
    physical memory

    x86:
    - nested virtualization fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    MIPS: KVM: Check for pfn noslot case
    kvm: nVMX: fix nested tsc scaling
    KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
    KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
    arm64: KVM: report configured SRE value to 32-bit world
    arm64: KVM: remove misleading comment on pmu status
    KVM: arm/arm64: timer: Workaround misconfigured timer interrupt
    arm64: Document workaround for Cortex-A72 erratum #853709
    KVM: arm/arm64: Change misleading use of is_error_pfn
    KVM: arm64: ITS: avoid re-mapping LPIs
    KVM: arm64: check for ITS device on MSI injection
    KVM: arm64: ITS: move ITS registration into first VCPU run
    KVM: arm64: vgic-its: Make updates to propbaser/pendbaser atomic
    KVM: arm64: vgic-its: Plug race in vgic_put_irq
    KVM: arm64: vgic-its: Handle errors from vgic_add_lpi
    KVM: arm64: ITS: return 1 on successful MSI injection

    Linus Torvalds
     

27 Aug, 2016

1 commit

  • Merge fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    mm: silently skip readahead for DAX inodes
    dax: fix device-dax region base
    fs/seq_file: fix out-of-bounds read
    mm: memcontrol: avoid unused function warning
    mm: clarify COMPACTION Kconfig text
    treewide: replace config_enabled() with IS_ENABLED() (2nd round)
    printk: fix parsing of "brl=" option
    soft_dirty: fix soft_dirty during THP split
    sysctl: handle error writing UINT_MAX to u32 fields
    get_maintainer: quiet noisy implicit -f vcs_file_exists checking
    byteswap: don't use __builtin_bswap*() with sparse

    Linus Torvalds