19 Jan, 2019

1 commit


18 Jan, 2019

1 commit


16 Jan, 2019

2 commits

  • add document for below counters:
    TcpEstabResets
    TcpAttemptFails
    TcpOutRsts
    TcpExtTCPSACKDiscard
    TcpExtTCPDSACKIgnoredOld
    TcpExtTCPDSACKIgnoredNoUndo
    TcpExtTCPSackShifted
    TcpExtTCPSackMerged
    TcpExtTCPSackShiftFallback
    TcpExtTCPWantZeroWindowAdv
    TcpExtTCPToZeroWindowAdv
    TcpExtTCPFromZeroWindowAdv
    TcpExtDelayedACKs
    TcpExtDelayedACKLocked
    TcpExtDelayedACKLost
    TcpExtTCPLossProbes
    TcpExtTCPLossProbeRecovery

    Signed-off-by: yupeng
    Signed-off-by: David S. Miller

    yupeng
     
  • The changes introduced to allow rxrpc calls to be retried creates an issue
    when it comes to refcounting afs_call structs. The problem is that when
    rxrpc_send_data() queues the last packet for an asynchronous call, the
    following sequence can occur:

    (1) The notify_end_tx callback is invoked which causes the state in the
    afs_call to be changed from AFS_CALL_CL_REQUESTING or
    AFS_CALL_SV_REPLYING.

    (2) afs_deliver_to_call() can then process event notifications from rxrpc
    on the async_work queue.

    (3) Delivery of events, such as an abort from the server, can cause the
    afs_call state to be changed to AFS_CALL_COMPLETE on async_work.

    (4) For an asynchronous call, afs_process_async_call() notes that the call
    is complete and tried to clean up all the refs on async_work.

    (5) rxrpc_send_data() might return the amount of data transferred
    (success) or an error - which could in turn reflect a local error or a
    received error.

    Synchronising the clean up after rxrpc_kernel_send_data() returns an error
    with the asynchronous cleanup is then tricky to get right.

    Mostly revert commit c038a58ccfd6704d4d7d60ed3d6a0fca13cf13a4. The two API
    functions the original commit added aren't currently used. This makes
    rxrpc_kernel_send_data() always return successfully if it queued the data
    it was given.

    Note that this doesn't affect synchronous calls since their Rx notification
    function merely pokes a wait queue and does not refcounting. The
    asynchronous call notification function *has* to do refcounting and pass a
    ref over the work item to avoid the need to sync the workqueue in call
    cleanup.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

04 Jan, 2019

1 commit

  • Pull networking fixes from David Miller:
    "Several fixes here. Basically split down the line between newly
    introduced regressions and long existing problems:

    1) Double free in tipc_enable_bearer(), from Cong Wang.

    2) Many fixes to nf_conncount, from Florian Westphal.

    3) op->get_regs_len() can throw an error, check it, from Yunsheng
    Lin.

    4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
    driver, from Scott Wood.

    5) Inifnite loop in fib_empty_table(), from Yue Haibing.

    6) Use after free in ax25_fillin_cb(), from Cong Wang.

    7) Fix socket locking in nr_find_socket(), also from Cong Wang.

    8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.

    9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.

    10) Fix ptr_ring wrap during queue swap, from Cong Wang.

    11) Missing shutdown callback in hinic driver, from Xue Chaojing.

    12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
    Brivio.

    13) BPF out of bounds speculation fixes from Daniel Borkmann"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
    ipv6: Consider sk_bound_dev_if when binding a socket to an address
    ipv6: Fix dump of specific table with strict checking
    bpf: add various test cases to selftests
    bpf: prevent out of bounds speculation on pointer arithmetic
    bpf: fix check_map_access smin_value test when pointer contains offset
    bpf: restrict unknown scalars of mixed signed bounds for unprivileged
    bpf: restrict stack pointer arithmetic for unprivileged
    bpf: restrict map value pointer arithmetic for unprivileged
    bpf: enable access to ax register also from verifier rewrite
    bpf: move tmp variable into ax register in interpreter
    bpf: move {prev_,}insn_idx into verifier env
    isdn: fix kernel-infoleak in capi_unlocked_ioctl
    ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
    net/hamradio/6pack: use mod_timer() to rearm timers
    net-next/hinic:add shutdown callback
    net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
    ip: validate header length on virtual device xmit
    tap: call skb_probe_transport_header after setting skb->dev
    ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
    net: rds: remove unnecessary NULL check
    ...

    Linus Torvalds
     

02 Jan, 2019

1 commit

  • add document and examples for below counters:
    TcpExtTCPOFOQueue
    TcpExtTCPOFODrop
    TcpExtTCPOFOMerge
    TcpExtPAWSActive
    TcpExtPAWSEstab
    TcpExtTCPACKSkippedSynRecv
    TcpExtTCPACKSkippedPAWS
    TcpExtTCPACKSkippedSeq
    TcpExtTCPACKSkippedFinWait2
    TcpExtTCPACKSkippedTimeWait
    TcpExtTCPACKSkippedChallenge

    Signed-off-by: yupeng
    Signed-off-by: David S. Miller

    yupeng
     

30 Dec, 2018

1 commit

  • Pull documentation update from Jonathan Corbet:
    "A fairly normal cycle for documentation stuff. We have a new document
    on perf security, more Italian translations, more improvements to the
    memory-management docs, improvements to the pathname lookup
    documentation, and the usual array of smaller fixes.

    As is often the case, there are a few reaches outside of
    Documentation/ to adjust kerneldoc comments"

    * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
    docs: improve pathname-lookup document structure
    configfs: fix wrong name of struct in documentation
    docs/mm-api: link slab_common.c to "The Slab Cache" section
    slab: make kmem_cache_create{_usercopy} description proper kernel-doc
    doc:process: add links where missing
    docs/core-api: make mm-api.rst more structured
    x86, boot: documentation whitespace fixup
    Documentation: devres: note checking needs when converting
    doc:it: add some process/* translations
    doc:it: fixes in process/1.Intro
    Documentation: convert path-lookup from markdown to resturctured text
    Documentation/admin-guide: update admin-guide index.rst
    Documentation/admin-guide: introduce perf-security.rst file
    scripts/kernel-doc: Fix struct and struct field attribute processing
    Documentation: dev-tools: Fix typos in index.rst
    Correct gen_init_cpio tool's documentation
    Document /proc/pid PID reuse behavior
    Documentation: update path-lookup.md for parallel lookups
    Documentation: Use "while" instead of "whilst"
    dmaengine: Add mailing list address to the documentation
    ...

    Linus Torvalds
     

21 Dec, 2018

4 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next:

    1) Support for destination MAC in ipset, from Stefano Brivio.

    2) Disallow all-zeroes MAC address in ipset, also from Stefano.

    3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
    introduce protocol version number 7, from Jozsef Kadlecsik.
    A follow up patch to fix ip_set_byindex() is also included
    in this batch.

    4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.

    5) Statify nf_flow_table_iterate(), from Taehee Yoo.

    6) Use nf_flow_table_iterate() to simplify garbage collection in
    nf_flow_table logic, also from Taehee Yoo.

    7) Don't use _bh variants of call_rcu(), rcu_barrier() and
    synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.

    8) Remove NFC_* cache definition from the old caching
    infrastructure.

    9) Remove layer 4 port rover in NAT helpers, use random port
    instead, from Florian Westphal.

    10) Use strscpy() in ipset, from Qian Cai.

    11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
    random port is allocated by default, from Xiaozhou Liu.

    12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.

    13) Limit port allocation selection routine in NAT to avoid
    softlockup splats when most ports are in use, from Florian.

    14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
    from Yafang Shao.

    15) Direct call to nf_nat_l4proto_unique_tuple() instead of
    indirection, from Florian Westphal.

    16) Several patches to remove all layer 4 NAT indirections,
    remove nf_nat_l4proto struct, from Florian Westphal.

    17) Fix RTP/RTCP source port translation when SNAT is in place,
    from Alin Nastac.

    18) Selective rule dump per chain, from Phil Sutter.

    19) Revisit CLUSTERIP target, this includes a deadlock fix from
    netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
    and disallow mismatching IP address and MAC address.
    Patchset from Taehee Yoo.

    20) Update UDP timeout to stream after 2 seconds, from Florian.

    21) Shrink UDP established timeout to 120 seconds like TCP timewait.

    22) Sysctl knobs to set GRE timeouts, from Yafang Shao.

    23) Move seq_print_acct() to conntrack core file, from Florian.

    24) Add enum for conntrack sysctl knobs, also from Florian.

    25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
    and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
    As a side effect, shrink netns_ct structure by removing obsolete
    sysctl anchors, also from Florian.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch adds two sysctl knobs for GRE:

    net.netfilter.nf_conntrack_gre_timeout = 30
    net.netfilter.nf_conntrack_gre_timeout_stream = 180

    Update the Documentation as well.

    Signed-off-by: Yafang Shao
    Signed-off-by: Pablo Neira Ayuso

    Yafang Shao
     
  • We have no explicit signal when a UDP stream has terminated, peers just
    stop sending.

    For suspected stream connections a timeout of two minutes is sane to keep
    NAT mapping alive a while longer.

    It matches tcp conntracks 'timewait' default timeout value.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Add some pointers to the definition of the CBS algorithm, and some
    notes about the limits of its implementation in the i210 family of
    controllers.

    Signed-off-by: Vinicius Costa Gomes
    Tested-by: Aaron Brown
    Signed-off-by: Jeff Kirsher

    Vinicius Costa Gomes
     

20 Dec, 2018

1 commit

  • Remove skb->sp and allocate secpath storage via extension
    infrastructure. This also reduces sk_buff by 8 bytes on x86_64.

    Total size of allyesconfig kernel is reduced slightly, as there is
    less inlined code (one conditional atomic op instead of two on
    skb_clone).

    No differences in throughput in following ipsec performance tests:
    - transport mode with aes on 10GB link
    - tunnel mode between two network namespaces with aes and null cipher

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

15 Dec, 2018

1 commit


08 Dec, 2018

1 commit

  • The existing garbage collection algorithm has a number of problems:

    1. The gc algorithm will not evict PERMANENT entries as those entries
    are managed by userspace, yet the existing algorithm walks the entire
    hash table which means it always considers PERMANENT entries when
    looking for entries to evict. In some use cases (e.g., EVPN) there
    can be tens of thousands of PERMANENT entries leading to wasted
    CPU cycles when gc kicks in. As an example, with 32k permanent
    entries, neigh_alloc has been observed taking more than 4 msec per
    invocation.

    2. Currently, when the number of neighbor entries hits gc_thresh2 and
    the last flush for the table was more than 5 seconds ago gc kicks in
    walks the entire hash table evicting *all* entries not in PERMANENT
    or REACHABLE state and not marked as externally learned. There is no
    discriminator on when the neigh entry was created or if it just moved
    from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

    It is possible for entries to be created or for established neighbor
    entries to be moved to STALE (e.g., an external node sends an ARP
    request) right before the 5 second window lapses:

    -----|---------x|----------|-----
    t-5 t t+5

    If that happens those entries are evicted during gc causing unnecessary
    thrashing on neighbor entries and userspace caches trying to track them.

    Further, this contradicts the description of gc_thresh2 which says
    "Entries older than 5 seconds will be cleared".

    One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
    whole point of having separate thresholds.

    3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
    when gc_thresh2 is exceeded is over kill and contributes to trashing
    especially during startup.

    This patch addresses these problems as follows:

    1. Use of a separate list_head to track entries that can be garbage
    collected along with a separate counter. PERMANENT entries are not
    added to this list.

    The gc_thresh parameters are only compared to the new counter, not the
    total entries in the table. The forced_gc function is updated to only
    walk this new gc_list looking for entries to evict.

    2. Entries are added to the list head at the tail and removed from the
    front.

    3. Entries are only evicted if they were last updated more than 5 seconds
    ago, adhering to the original intent of gc_thresh2.

    4. Forced gc is stopped once the number of gc_entries drops below
    gc_thresh2.

    5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
    when allocating a new neighbor for a PERMANENT entry. By extension this
    means there are no explicit limits on the number of PERMANENT entries
    that can be created, but this is no different than FIB entries or FDB
    entries.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

06 Dec, 2018

1 commit

  • Documentation/networking/ is full of cryptically named files with
    driver documentation. This makes finding interesting information
    at a glance really hard. Move all those files into a directory
    called device_drivers (since not all drivers are for device) and
    fix up references.

    RFC v0.1 -> RFC v1:
    - also add .txt suffix to the files which are missing it (Quentin)

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Acked-by: David Ahern
    Acked-by: Henrik Austad
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

04 Dec, 2018

1 commit

  • Many drivers load the device's firmware image during the initialization
    flow either from the flash or from the disk. Currently this option is not
    controlled by the user and the driver decides from where to load the
    firmware image.

    'fw_load_policy' gives the ability to control this option which allows the
    user to choose between different loading policies supported by the driver.

    This parameter can be useful while testing and/or debugging the device. For
    example, testing a firmware bug fix.

    Signed-off-by: Shalom Toledo
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Shalom Toledo
     

28 Nov, 2018

1 commit

  • Add explaination of below counters:
    TcpExtTCPRcvCoalesce
    TcpExtTCPAutoCorking
    TcpExtTCPOrigDataSent
    TCPSynRetrans
    TCPFastOpenActiveFail
    TcpExtListenOverflows
    TcpExtListenDrops
    TcpExtTCPHystartTrainDetect
    TcpExtTCPHystartTrainCwnd
    TcpExtTCPHystartDelayDetect
    TcpExtTCPHystartDelayCwnd

    Signed-off-by: yupeng
    Signed-off-by: David S. Miller

    yupeng
     

22 Nov, 2018

2 commits


21 Nov, 2018

1 commit

  • Whilst making an unrelated change to some Documentation, Linus sayeth:

    | Afaik, even in Britain, "whilst" is unusual and considered more
    | formal, and "while" is the common word.
    |
    | [...]
    |
    | Can we just admit that we work with computers, and we don't need to
    | use þe eald Englisc spelling of words that most of the world never
    | uses?

    dictionary.com refers to the word as "Chiefly British", which is
    probably an undesirable attribute for technical documentation.

    Replace all occurrences under Documentation/ with "while".

    Cc: David Howells
    Cc: Liam Girdwood
    Cc: Chris Wilson
    Cc: Michael Halcrow
    Cc: Jonathan Corbet
    Reported-by: Linus Torvalds
    Signed-off-by: Will Deacon
    Signed-off-by: Jonathan Corbet

    Will Deacon
     

20 Nov, 2018

1 commit


19 Nov, 2018

1 commit


16 Nov, 2018

1 commit

  • The life-checking function, which is used by kAFS to make sure that a call
    is still live in the event of a pending signal, only samples the received
    packet serial number counter; it doesn't actually provoke a change in the
    counter, rather relying on the server to happen to give us a packet in the
    time window.

    Fix this by adding a function to force a ping to be transmitted.

    kAFS then keeps track of whether there's been a stall, and if so, uses the
    new function to ping the server, resetting the timeout to allow the reply
    to come back.

    If there's a stall, a ping and the call is *still* stalled in the same
    place after another period, then the call will be aborted.

    Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
    Fixes: f4d15fb6f99a ("rxrpc: Provide functions for allowing cleaner handling of signals")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

12 Nov, 2018

2 commits

  • FQ pacing guarantees that paced packets queued by one flow do not
    add head-of-line blocking for other flows.

    After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
    since this maps to 16 skbs at most in qdisc or device queues.
    (or slightly more if some drivers lower {gso_max_segs|size})

    We still can queue at most 1 ms worth of traffic (this can be scaled
    by wifi drivers if they need to)

    Tested:

    # ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
    tx-usecs: 16
    tx-frames: 16
    # tc qdisc replace dev eth0 root fq
    # for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done

    Before patch:
    27711
    26118
    27107
    27377
    27712
    27388
    27340
    27117
    27278
    27509

    After patch:
    37434
    36949
    36658
    36998
    37711
    37291
    37605
    36659
    36544
    37349

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The snmp_counter.rst explains the meanings of snmp counters. It also
    provides a set of experiments (only 1 for this initial patch),
    combines the experiments' resutls and the snmp counters'
    meanings. This is an initial path, only explains a part of IP/ICMP
    counters and provide a simple ping test.

    Signed-off-by: yupeng
    Signed-off-by: David S. Miller

    yupeng
     

08 Nov, 2018

2 commits

  • Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
    similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
    for datagram sockets. Have this default to enabled for reasons of
    backwards compatibility. This is so as to specify the output device
    with cmsg and IP_PKTINFO, but using a socket not bound to the
    corresponding VRF. This allows e.g. older ping implementations to be
    run with specifying the device but without executing it in the VRF.
    If the option is disabled, packets received in a VRF context are only
    handled by a raw socket bound to the VRF, and correspondingly packets
    in the default VRF are only handled by a socket not bound to any VRF.

    Signed-off-by: Mike Manning
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Mike Manning
     
  • Change the inet socket lookup to avoid packets arriving on a device
    enslaved to an l3mdev from matching unbound sockets by removing the
    wildcard for non sk_bound_dev_if and instead relying on check against
    the secondary device index, which will be 0 when the input device is
    not enslaved to an l3mdev and so match against an unbound socket and
    not match when the input device is enslaved.

    Change the socket binding to take the l3mdev into account to allow an
    unbound socket to not conflict sockets bound to an l3mdev given the
    datapath isolation now guaranteed.

    Signed-off-by: Robert Shearman
    Signed-off-by: Mike Manning
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Robert Shearman
     

04 Nov, 2018

1 commit

  • Pull Kbuild updates from Masahiro Yamada:

    - clean-up leftovers in Kconfig files

    - remove stale oldnoconfig and silentoldconfig targets

    - remove unneeded cc-fullversion and cc-name variables

    - improve merge_config script to allow overriding option prefix

    * tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: remove cc-name variable
    kbuild: replace cc-name test with CONFIG_CC_IS_CLANG
    merge_config.sh: Allow to define config prefix
    kbuild: remove unused cc-fullversion variable
    kconfig: remove silentoldconfig target
    kconfig: remove oldnoconfig target
    powerpc: PCI_MSI needs PCI
    powerpc: remove CONFIG_MCA leftovers
    powerpc: remove CONFIG_PCI_QSPAN
    scsi: aha152x: rename the PCMCIA define

    Linus Torvalds
     

01 Nov, 2018

1 commit


30 Oct, 2018

1 commit


25 Oct, 2018

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "This is a fairly typical cycle for documentation. There's some welcome
    readability improvements for the formatted output, some LICENSES
    updates including the addition of the ISC license, the removal of the
    unloved and unmaintained 00-INDEX files, the deprecated APIs document
    from Kees, more MM docs from Mike Rapoport, and the usual pile of typo
    fixes and corrections"

    * tag 'docs-4.20' of git://git.lwn.net/linux: (41 commits)
    docs: Fix typos in histogram.rst
    docs: Introduce deprecated APIs list
    kernel-doc: fix declaration type determination
    doc: fix a typo in adding-syscalls.rst
    docs/admin-guide: memory-hotplug: remove table of contents
    doc: printk-formats: Remove bogus kobject references for device nodes
    Documentation: preempt-locking: Use better example
    dm flakey: Document "error_writes" feature
    docs/completion.txt: Fix a couple of punctuation nits
    LICENSES: Add ISC license text
    LICENSES: Add note to CDDL-1.0 license that it should not be used
    docs/core-api: memory-hotplug: add some details about locking internals
    docs/core-api: rename memory-hotplug-notifier to memory-hotplug
    docs: improve readability for people with poorer eyesight
    yama: clarify ptrace_scope=2 in Yama documentation
    docs/vm: split memory hotplug notifier description to Documentation/core-api
    docs: move memory hotplug description into admin-guide/mm
    doc: Fix acronym "FEKEK" in ecryptfs
    docs: fix some broken documentation references
    iommu: Fix passthrough option documentation
    ...

    Linus Torvalds
     

19 Oct, 2018

9 commits