20 Sep, 2018

2 commits

  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller
    (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

12 Jun, 2018

1 commit

  • [ Upstream commit 75d4e704fa8d2cf33ff295e5b441317603d7f9fd ]

    Per discussion with David at netconf 2018, let's clarify
    DaveM's position of handling stable backports in netdev-FAQ.

    This is important for people relying on upstream -stable
    releases.

    Cc: Greg Kroah-Hartman
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     

09 Mar, 2018

1 commit

  • [ Upstream commit a61a86f8db92923a2a4c857c49a795bcae754497 ]

    The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096. And the
    tcp_wmem/tcp_rmem min default values are 4096.

    Fixes: bd68a2a854ad ("net: set SK_MEM_QUANTUM to 4096")
    Cc: Eric Dumazet
    Signed-off-by: Tonghao Zhang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tonghao Zhang
     

08 Oct, 2017

1 commit


23 Sep, 2017

2 commits

  • Pull networking fixes from David Miller:

    1) Fix NAPI poll list corruption in enic driver, from Christian
    Lamparter.

    2) Fix route use after free, from Eric Dumazet.

    3) Fix regression in reuseaddr handling, from Josef Bacik.

    4) Assert the size of control messages in compat handling since we copy
    it in from userspace twice. From Meng Xu.

    5) SMC layer bug fixes (missing RCU locking, bad refcounting, etc.)
    from Ursula Braun.

    6) Fix races in AF_PACKET fanout handling, from Willem de Bruijn.

    7) Don't use ARRAY_SIZE on spinlock array which might have zero
    entries, from Geert Uytterhoeven.

    8) Fix miscomputation of checksum in ipv6 udp code, from Subash Abhinov
    Kasiviswanathan.

    9) Push the ipv6 header properly in ipv6 GRE tunnel driver, from Xin
    Long.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (75 commits)
    inet: fix improper empty comparison
    net: use inet6_rcv_saddr to compare sockets
    net: set tb->fast_sk_family
    net: orphan frags on stand-alone ptype in dev_queue_xmit_nit
    MAINTAINERS: update git tree locations for ieee802154 subsystem
    net: prevent dst uses after free
    net: phy: Fix truncation of large IRQ numbers in phy_attached_print()
    net/smc: no close wait in case of process shut down
    net/smc: introduce a delay
    net/smc: terminate link group if out-of-sync is received
    net/smc: longer delay for client link group removal
    net/smc: adapt send request completion notification
    net/smc: adjust net_device refcount
    net/smc: take RCU read lock for routing cache lookup
    net/smc: add receive timeout check
    net/smc: add missing dev_put
    net: stmmac: Cocci spatch "of_table"
    lan78xx: Use default values loaded from EEPROM/OTP after reset
    lan78xx: Allow EEPROM write for less than MAX_EEPROM_SIZE
    lan78xx: Fix for eeprom read/write when device auto suspend
    ...

    Linus Torvalds
     
  • Pull seccomp updates from Kees Cook:
    "Major additions:

    - sysctl and seccomp operation to discover available actions
    (tyhicks)

    - new per-filter configurable logging infrastructure and sysctl
    (tyhicks)

    - SECCOMP_RET_LOG to log allowed syscalls (tyhicks)

    - SECCOMP_RET_KILL_PROCESS as the new strictest possible action

    - self-tests for new behaviors"

    [ This is the seccomp part of the security pull request during the merge
    window that was nixed due to unrelated problems - Linus ]

    * tag 'seccomp-v4.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    samples: Unrename SECCOMP_RET_KILL
    selftests/seccomp: Test thread vs process killing
    seccomp: Implement SECCOMP_RET_KILL_PROCESS action
    seccomp: Introduce SECCOMP_RET_KILL_PROCESS
    seccomp: Rename SECCOMP_RET_KILL to SECCOMP_RET_KILL_THREAD
    seccomp: Action to log before allowing
    seccomp: Filter flag to log all actions except SECCOMP_RET_ALLOW
    seccomp: Selftest for detection of filter flag support
    seccomp: Sysctl to configure actions that are allowed to be logged
    seccomp: Operation for checking if an action is available
    seccomp: Sysctl to display available actions
    seccomp: Provide matching filter for introspection
    selftests/seccomp: Refactor RET_ERRNO tests
    selftests/seccomp: Add simple seccomp overhead benchmark
    selftests/seccomp: Add tests for basic ptrace actions

    Linus Torvalds
     

20 Sep, 2017

1 commit

  • Currently, writing into
    net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
    Fix handling of these flags by:

    - using the maximum of global and per-interface values for the
    accept_dad flag. That is, if at least one of the two values is
    non-zero, enable DAD on the interface. If at least one value is
    set to 2, enable DAD and disable IPv6 operation on the interface if
    MAC-based link-local address was found

    - using the logical OR of global and per-interface values for the
    optimistic_dad flag. If at least one of them is set to one, optimistic
    duplicate address detection (RFC 4429) is enabled on the interface

    - using the logical OR of global and per-interface values for the
    use_optimistic flag. If at least one of them is set to one,
    optimistic addresses won't be marked as deprecated during source address
    selection on the interface.

    While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
    drop inline, and let the compiler decide.

    Fixes: 7fd2561e4ebd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates")
    Signed-off-by: Matteo Croce
    Signed-off-by: David S. Miller

    Matteo Croce
     

19 Sep, 2017

1 commit

  • Fix ASCII art in Documentation/networking/switchdev.txt:

    Change non-ASCII "spaces" to ASCII spaces.

    Change 2 erroneous '+' characters in ASCII art to '-' (at the '*'
    characters below):

    line 32:
    +--+----+----+----+-*--+----+---+ +-----+-----+
    line 41:
    +--------------+---*------------+

    Signed-off-by: Randy Dunlap
    Acked-by: Pavel Machek
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Randy Dunlap
     

17 Sep, 2017

1 commit


07 Sep, 2017

1 commit

  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

04 Sep, 2017

2 commits

  • Pull documentation updates from Jonathan Corbet:
    "After a fair amount of churn in the last couple of cycles, docs are
    taking it easier this time around. Lots of fixes and some new
    documentation, but nothing all that radical. Perhaps the most
    interesting change for many is the scripts/sphinx-pre-install tool
    from Mauro; it will tell you exactly which packages you need to
    install to get a working docs toolchain on your system.

    There are two little patches reaching outside of Documentation/; both
    just tweak kerneldoc comments to eliminate warnings and fix some
    dangling doc pointers"

    * 'docs-next' of git://git.lwn.net/linux: (52 commits)
    Documentation/sphinx: fix kernel-doc decode for non-utf-8 locale
    genalloc: Fix an incorrect kerneldoc comment
    doc: Add documentation for the genalloc subsystem
    assoc_array: fix path to assoc_array documentation
    kernel-doc parser mishandles declarations split into lines
    docs: ReSTify table of contents in core.rst
    docs: process: drop git snapshots from applying-patches.rst
    Documentation:input: fix typo
    swap: Remove obsolete sentence
    sphinx.rst: Allow Sphinx version 1.6 at the docs
    docs-rst: fix verbatim font size on tables
    Documentation: stable-kernel-rules: fix broken git urls
    rtmutex: update rt-mutex
    rtmutex: update rt-mutex-design
    docs: fix minimal sphinx version in conf.py
    docs: fix nested numbering in the TOC
    NVMEM documentation fix: A minor typo
    docs-rst: pdf: use same vertical margin on all Sphinx versions
    doc: Makefile: if sphinx is not found, run a check script
    docs: Fix paths in security/keys
    ...

    Linus Torvalds
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. Basically, updates to the conntrack core, enhancements for
    nf_tables, conversion of netfilter hooks from linked list to array to
    improve memory locality and asorted improvements for the Netfilter
    codebase. More specifically, they are:

    1) Add expection to hashes after timer initialization to prevent
    access from another CPU that walks on the hashes and calls
    del_timer(), from Florian Westphal.

    2) Don't update nf_tables chain counters from hot path, this is only
    used by the x_tables compatibility layer.

    3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
    Hooks are always guaranteed to run from rcu read side, so remove
    nested rcu_read_lock() where possible. Patch from Taehee Yoo.

    4) nf_tables new ruleset generation notifications include PID and name
    of the process that has updated the ruleset, from Phil Sutter.

    5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
    the nf_family netdev family. Patch from Pablo M. Bermudo.

    6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

    7) Use deferrable workqueue for conntrack garbage collection, to reduce
    power consumption, from Patch from Subash Abhinov Kasiviswanathan.

    8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
    Westphal.

    9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

    10) Drop references on conntrack removal path when skbuffs has escaped via
    nfqueue, from Florian.

    11) Don't queue packets to nfqueue with dying conntrack, from Florian.

    12) Constify nf_hook_ops structure, from Florian.

    13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

    14) Add nla_strdup(), from Phil Sutter.

    15) Rise nf_tables objects name size up to 255 chars, people want to use
    DNS names, so increase this according to what RFC 1035 specifies.
    Patch series from Phil Sutter.

    16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
    registration on demand, suggested by Eric Dumazet, patch from Florian.

    17) Remove unused variables in compat_copy_entry_from_user both in
    ip_tables and arp_tables code. Patch from Taehee Yoo.

    18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

    19) Constify nf_loginfo structure, also from Julia.

    20) Use a single rb root in connlimit, from Taehee Yoo.

    21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

    22) Use audit_log() instead of open-coding it, from Geliang Tang.

    23) Allow to mangle tcp options via nft_exthdr, from Florian.

    24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
    a fix for a miscalculation of the minimal length.

    25) Simplify branch logic in h323 helper, from Nick Desaulniers.

    26) Calculate netlink attribute size for conntrack tuple at compile
    time, from Florian.

    27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
    From Florian.

    28) Remove holes in nf_conntrack_l4proto structure, so it becomes
    smaller. From Florian.

    29) Get rid of print_tuple() indirection for /proc conntrack listing.
    Place all the code in net/netfilter/nf_conntrack_standalone.c.
    Patch from Florian.

    30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
    off. From Florian.

    31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
    Florian.

    32) Fix broken indentation in ebtables extensions, from Colin Ian King.

    33) Fix several harmless sparse warning, from Florian.

    34) Convert netfilter hook infrastructure to use array for better memory
    locality, joint work done by Florian and Aaron Conole. Moreover, add
    some instrumentation to debug this.

    35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
    per batch, from Florian.

    36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

    37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

    38) Remove unused code in the generic protocol tracker, from Davide
    Caratti.

    I think I will have material for a second Netfilter batch in my queue if
    time allow to make it fit in this merge window.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Sep, 2017

1 commit

  • Documentation for this feature was missing from the patchset.
    Copied a lot from the netdev 2.1 paper, addressing some small
    interface changes since then.

    Changes
    v1 -> v2
    - change email discussion URL format
    - clarify that u32 counter is per-syscall, unsigned and
    wraps after UINT_MAX calls
    - describe errno on send failure specific to MSG_ZEROCOPY
    - a few very minor rewordings

    Signed-off-by: Willem de Bruijn
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

31 Aug, 2017

2 commits


30 Aug, 2017

2 commits

  • Florian reported UDP xmit drops that could be root caused to the
    too small neigh limit.

    Current limit is 64 KB, meaning that even a single UDP socket would hit
    it, since its default sk_sndbuf comes from net.core.wmem_default
    (~212992 bytes on 64bit arches).

    Once ARP/ND resolution is in progress, we should allow a little more
    packets to be queued, at least for one producer.

    Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
    limit and either block in sendmsg() or return -EAGAIN.

    Signed-off-by: Eric Dumazet
    Reported-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Explain that the patch queue in patchwork should not be touched by patch
    submitters.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     

29 Aug, 2017

3 commits

  • Allow a client call that failed on network error to be retried, provided
    that the Tx queue still holds DATA packet 1. This allows an operation to
    be submitted to another server or another address for the same server
    without having to repackage and re-encrypt the data so far processed.

    Two new functions are provided:

    (1) rxrpc_kernel_check_call() - This is used to find out the completion
    state of a call to guess whether it can be retried and whether it
    should be retried.

    (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
    connection, reset the state and submit it as a new client call to a
    new address. The new address need not match the previous address.

    A call may be retried even if all the data hasn't been loaded into it yet;
    a partially constructed will be retained at the same point it was at when
    an error condition was detected. msg_data_left() can be used to find out
    how much data was packaged before the error occurred.

    Signed-off-by: David Howells

    David Howells
     
  • Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
    a notification that the AF_RXRPC call has transitioned out the Tx phase and
    is now waiting for a reply or a final ACK.

    This is called from AF_RXRPC with the call state lock held so the
    notification is guaranteed to come before any reply is passed back.

    Further, modify the AFS filesystem to make use of this so that we don't have
    to change the afs_call state before sending the last bit of data.

    Signed-off-by: David Howells

    David Howells
     
  • Signed-off-by: Madalin Bucur
    Signed-off-by: David S. Miller

    Madalin Bucur
     

25 Aug, 2017

2 commits

  • commit bbb03029a899 ("strparser: Generalize strparser") added more
    function pointers to 'struct strp_callbacks'; however, kcm_attach() was
    not updated to initialize them. This could cause the ->lock() and/or
    ->unlock() function pointers to be set to garbage values, causing a
    crash in strp_work().

    Fix the bug by moving the callback structs into static memory, so
    unspecified members are zeroed. Also constify them while we're at it.

    This bug was found by syzkaller, which encountered the following splat:

    IP: 0x55
    PGD 3b1ca067
    P4D 3b1ca067
    PUD 3b12f067
    PMD 0

    Oops: 0010 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Workqueue: kstrp strp_work
    task: ffff88006bb0e480 task.stack: ffff88006bb10000
    RIP: 0010:0x55
    RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
    RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
    RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
    RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
    R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
    R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
    FS: 0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
    worker_thread+0x223/0x1860 kernel/workqueue.c:2233
    kthread+0x35e/0x430 kernel/kthread.c:231
    ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
    Code: Bad RIP value.
    RIP: 0x55 RSP: ffff88006bb17540
    CR2: 0000000000000055
    ---[ end trace f0e4920047069cee ]---

    Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
    CONFIG_AF_KCM=y):

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static const struct bpf_insn bpf_insns[3] = {
    { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
    { .code = 0x95 }, /* BPF_EXIT_INSN() */
    };

    static const union bpf_attr bpf_attr = {
    .prog_type = 1,
    .insn_cnt = 2,
    .insns = (uintptr_t)&bpf_insns,
    .license = (uintptr_t)"",
    };

    int main(void)
    {
    int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
    &bpf_attr, sizeof(bpf_attr));
    int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
    int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);

    ioctl(kcm_fd, SIOCKCMATTACH,
    &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
    }

    Fixes: bbb03029a899 ("strparser: Generalize strparser")
    Cc: Dmitry Vyukov
    Cc: Tom Herbert
    Signed-off-by: Eric Biggers
    Signed-off-by: David S. Miller

    Eric Biggers
     
  • Reflecting IPv6 Flow Label at server nodes is useful in environments
    that employ multipath routing to load balance the requests. As "IPv6
    Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
    messages generated in response to a downstream packets from the server
    can be routed by a load balancer back to the original server without
    looking at transport headers, if the server applies the flow label
    reflection. This enables the Path MTU Discovery past the ECMP router in
    load-balance or anycast environments where each server node is reachable
    by only one path.

    Introduce a sysctl to enable flow label reflection per net namespace for
    all newly created sockets. Same could be earlier achieved only per
    socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
    socket option.

    [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: David S. Miller

    Jakub Sitnicki
     

24 Aug, 2017

1 commit

  • As eBPF JIT support for arm32 was added recently with
    commit 39c13c204bb1150d401e27d41a9d8b332be47c49, it seems appropriate to
    add arm32 as arch with support for eBPF JIT in bpf and sysctl docs as well.

    Signed-off-by: Shubham Bansal
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Shubham Bansal
     

23 Aug, 2017

2 commits


22 Aug, 2017

1 commit


21 Aug, 2017

1 commit


15 Aug, 2017

1 commit


11 Aug, 2017

1 commit

  • The function declaration in the lastest include/net/mac802154.h has been
    changed since v3.19.

    ieee802154_alloc_device => ieee802154_alloc_hw
    ieee802154_free_device => ieee802154_free_hw
    ieee802154_register_device => ieee802154_register_hw
    ieee802154_unregister_device => ieee802154_unregister_hw

    However, the description in the Device drivers API section of
    Documentation/networking/ieee802154.txt is still in the state of
    v3.18.63.

    Signed-off-by: Jian-Hong Pan
    Acked-by: Stefan Schmidt
    Signed-off-by: Jonathan Corbet

    Jian-Hong Pan
     

10 Aug, 2017

1 commit

  • Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
    BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
    particularly *JLT/*JLE counterparts involving immediates need
    to be rewritten from e.g. X < [IMM] by swapping arguments into
    [IMM] > X, meaning the immediate first is required to be loaded
    into a register Y := [IMM], such that then we can compare with
    Y > X. Note that the destination operand is always required to
    be a register.

    This has the downside of having unnecessarily increased register
    pressure, meaning complex program would need to spill other
    registers temporarily to stack in order to obtain an unused
    register for the [IMM]. Loading to registers will thus also
    affect state pruning since we need to account for that register
    use and potentially those registers that had to be spilled/filled
    again. As a consequence slightly more stack space might have
    been used due to spilling, and BPF programs are a bit longer
    due to extra code involving the register load and potentially
    required spill/fills.

    Thus, add BPF_JLT (
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Aug, 2017

1 commit


04 Aug, 2017

1 commit

  • Simon Wunderlich says:

    ====================
    This feature/cleanup patchset includes the following patches:

    - bump version strings, by Simon Wunderlich

    - Remove unnecessary length qualifier, by Joe Perches

    - Remove too short %pM field width, by Sven Eckelmann

    - Remove return value handling from skb_put_data, by Sven Eckelmann

    - Spelling fixes, by Colin Ian King

    - Convert batman-adv.txt to reStructuredText, by Sven Eckelmann
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Aug, 2017

1 commit


02 Aug, 2017

1 commit

  • Generalize strparser from more than just being used in conjunction
    with read_sock. strparser will also be used in the send path with
    zero proxy. The primary change is to create strp_process function
    that performs the critical processing on skbs. The documentation
    is also updated to reflect the new uses.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

01 Aug, 2017

2 commits

  • Was only checked by the removed prequeue code.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Discussion during NFWS 2017 in Faro has shown that the current
    conntrack behaviour is unreasonable.

    Even if conntrack module is loaded on behalf of a single net namespace,
    its turned on for all namespaces, which is expensive. Commit
    481fa373476 ("netfilter: conntrack: add nf_conntrack_default_on sysctl")
    attempted to provide an alternative to the 'default on' behaviour by
    adding a sysctl to change it.

    However, as Eric points out, the sysctl only becomes available
    once the module is loaded, and then its too late.

    So we either have to move the sysctl to the core, or, alternatively,
    change conntrack to become active only once the rule set requires this.

    This does the latter, conntrack is only enabled when a rule needs it.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

29 Jul, 2017

1 commit


19 Jul, 2017

1 commit

  • revert c386578f1cdb4dac230395 ("xfrm: Let the flowcache handle its size by default.").

    Once we remove flow cache, we don't have a flow cache limit anymore.
    We must not allow (virtually) unlimited allocations of xfrm dst entries.
    Revert back to the old xfrm dst gc limits.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

12 Jul, 2017

1 commit

  • Passing (void*)val instead of &val would make a pointer out of an integer
    and cause sock_setsockopt to -EFAULT.

    See tools/testing/selftests/networking/timestamping/timestamping.c
    for a working example.

    Cc: David S. Miller
    Cc: netdev@vger.kernel.org
    Signed-off-by: Ahmad Fatoum
    Signed-off-by: David S. Miller

    Ahmad Fatoum