10 Jan, 2019

1 commit

  • [ Upstream commit 3a0ed3e9619738067214871e9cb826fa23b2ddb9 ]

    Al Viro mentioned (Message-ID
    )
    that there is probably a race condition
    lurking in accesses of sk_stamp on 32-bit machines.

    sock->sk_stamp is of type ktime_t which is always an s64.
    On a 32 bit architecture, we might run into situations of
    unsafe access as the access to the field becomes non atomic.

    Use seqlocks for synchronization.
    This allows us to avoid using spinlocks for readers as
    readers do not need mutual exclusion.

    Another approach to solve this is to require sk_lock for all
    modifications of the timestamps. The current approach allows
    for timestamps to have their own lock: sk_stamp_lock.
    This allows for the patch to not compete with already
    existing critical sections, and side effects are limited
    to the paths in the patch.

    The addition of the new field maintains the data locality
    optimizations from
    commit 9115e8cd2a0c ("net: reorganize struct sock for better data
    locality")

    Note that all the instances of the sk_stamp accesses
    are either through the ioctl or the syscall recvmsg.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Deepa Dinamani
     

01 Dec, 2018

1 commit

  • commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

    syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
    in tcp_close()

    While a socket is being closed, it is very possible other
    threads find it in rtnetlink dump.

    tcp_get_info() will acquire the socket lock for a short amount
    of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
    enough to trigger the warning.

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

17 Dec, 2017

1 commit

  • [ Upstream commit d7efc6c11b277d9d80b99b1334a78bfe7d7edf10 ]

    Alexander Potapenko reported use of uninitialized memory [1]

    This happens when inserting a request socket into TCP ehash,
    in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.

    Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for
    mixed v4/v6 sockets")

    Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6
    ordering fix") missed the opportunity to get rid of
    hlist_nulls_add_tail_rcu() :

    Both UDP sockets and TCP/DCCP listeners no longer use
    __sk_nulls_add_node_rcu() for their hash insertion.

    Since all other sockets have unique 4-tuple, the reuseport status
    has no special meaning, so we can always use hlist_nulls_add_head_rcu()
    for them and save few cycles/instructions.

    [1]

    ==================================================================
    BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     
     __dump_stack lib/dump_stack.c:16
     dump_stack+0x185/0x1d0 lib/dump_stack.c:52
     kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
     __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
     __sk_nulls_add_node_rcu ./include/net/sock.h:684
     inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
     reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
     inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
     tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
     tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
     tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
     tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
     tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
     ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
     NF_HOOK ./include/linux/netfilter.h:248
     ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
     dst_input ./include/net/dst.h:477
     ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
     NF_HOOK ./include/linux/netfilter.h:248
     ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
     __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
     __netif_receive_skb net/core/dev.c:4336
     netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
     napi_skb_finish net/core/dev.c:4858
     napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
     e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
     e1000_clean_rx_irq+0x1492/0x1d30
    drivers/net/ethernet/intel/e1000/e1000_main.c:4474
     e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
     napi_poll net/core/dev.c:5500
     net_rx_action+0x73c/0x1820 net/core/dev.c:5566
     __do_softirq+0x4b4/0x8dd kernel/softirq.c:284
     invoke_softirq kernel/softirq.c:364
     irq_exit+0x203/0x240 kernel/softirq.c:405
     exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
     do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
     common_interrupt+0x86/0x86

    Fixes: d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
    Fixes: d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
    Signed-off-by: Eric Dumazet
    Reported-by: Alexander Potapenko
    Acked-by: Craig Gallek
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

22 Sep, 2017

1 commit

  • In linux-4.13, Wei worked hard to convert dst to a traditional
    refcounted model, removing GC.

    We now want to make sure a dst refcount can not transition from 0 back
    to 1.

    The problem here is that input path attached a not refcounted dst to an
    skb. Then later, because packet is forwarded and hits skb_dst_force()
    before exiting RCU section, we might try to take a refcount on one dst
    that is about to be freed, if another cpu saw 1 -> 0 transition in
    dst_release() and queued the dst for freeing after one RCU grace period.

    Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
    always perform the complete check against dst refcount, and not assume
    it is not zero.

    Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005

    [ 989.919496] skb_dst_force+0x32/0x34
    [ 989.919498] __dev_queue_xmit+0x1ad/0x482
    [ 989.919501] ? eth_header+0x28/0xc6
    [ 989.919502] dev_queue_xmit+0xb/0xd
    [ 989.919504] neigh_connected_output+0x9b/0xb4
    [ 989.919507] ip_finish_output2+0x234/0x294
    [ 989.919509] ? ipt_do_table+0x369/0x388
    [ 989.919510] ip_finish_output+0x12c/0x13f
    [ 989.919512] ip_output+0x53/0x87
    [ 989.919513] ip_forward_finish+0x53/0x5a
    [ 989.919515] ip_forward+0x2cb/0x3e6
    [ 989.919516] ? pskb_trim_rcsum.part.9+0x4b/0x4b
    [ 989.919518] ip_rcv_finish+0x2e2/0x321
    [ 989.919519] ip_rcv+0x26f/0x2eb
    [ 989.919522] ? vlan_do_receive+0x4f/0x289
    [ 989.919523] __netif_receive_skb_core+0x467/0x50b
    [ 989.919526] ? tcp_gro_receive+0x239/0x239
    [ 989.919529] ? inet_gro_receive+0x226/0x238
    [ 989.919530] __netif_receive_skb+0x4d/0x5f
    [ 989.919532] netif_receive_skb_internal+0x5c/0xaf
    [ 989.919533] napi_gro_receive+0x45/0x81
    [ 989.919536] ixgbe_poll+0xc8a/0xf09
    [ 989.919539] ? kmem_cache_free_bulk+0x1b6/0x1f7
    [ 989.919540] net_rx_action+0xf4/0x266
    [ 989.919543] __do_softirq+0xa8/0x19d
    [ 989.919545] irq_exit+0x5d/0x6b
    [ 989.919546] do_IRQ+0x9c/0xb5
    [ 989.919548] common_interrupt+0x93/0x93
    [ 989.919548]

    Similarly dst_clone() can use dst_hold() helper to have additional
    debugging, as a follow up to commit 44ebe79149ff ("net: add debug
    atomic_inc_not_zero() in dst_hold()")

    In net-next we will convert dst atomic_t to refcount_t for peace of
    mind.

    Fixes: a4c2fd7f7891 ("net: remove DST_NOCACHE flag")
    Signed-off-by: Eric Dumazet
    Cc: Wei Wang
    Reported-by: Paweł Staszewski
    Bisected-by: Paweł Staszewski
    Acked-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Aug, 2017

1 commit

  • Florian reported UDP xmit drops that could be root caused to the
    too small neigh limit.

    Current limit is 64 KB, meaning that even a single UDP socket would hit
    it, since its default sk_sndbuf comes from net.core.wmem_default
    (~212992 bytes on 64bit arches).

    Once ARP/ND resolution is in progress, we should allow a little more
    packets to be queued, at least for one producer.

    Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
    limit and either block in sendmsg() or return -EAGAIN.

    Signed-off-by: Eric Dumazet
    Reported-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
    headers from UDP packets before queueing"), when udp packets are being
    peeked the requested extra offset is always 0 as there is no need to skip
    the udp header. However, when the offset is 0 and the next skb is
    of length 0, it is only returned once. The behaviour can be seen with
    the following python script:

    from socket import *;
    f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    f.bind(('::', 0));
    addr=('::1', f.getsockname()[1]);
    g.sendto(b'', addr)
    g.sendto(b'b', addr)
    print(f.recvfrom(10, MSG_PEEK));
    print(f.recvfrom(10, MSG_PEEK));

    Where the expected output should be the empty string twice.

    Instead, make sk_peek_offset return negative values, and pass those values
    to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
    to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
    __skb_try_recv_from_queue will then ensure the offset is reset back to 0
    if a peek is requested without an offset, unless no packets are found.

    Also simplify the if condition in __skb_try_recv_from_queue. If _off is
    greater then 0, and off is greater then or equal to skb->len, then
    (_off || skb->len) must always be true assuming skb->len >= 0 is always
    true.

    Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
    as it double checked if MSG_PEEK was set in the flags.

    V2:
    - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
    redundant checks
    - Fix peeking in udp{,v6}_recvmsg to report the right value when the
    offset is 0

    V3:
    - Marked new branch in __skb_try_recv_from_queue as unlikely.

    Signed-off-by: Matthew Dawson
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Matthew Dawson
     

04 Aug, 2017

2 commits

  • The kernel supports zerocopy sendmsg in virtio and tap. Expand the
    infrastructure to support other socket types. Introduce a completion
    notification channel over the socket error queue. Notifications are
    returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
    blocking the send/recv path on receiving notifications.

    Add reference counting, to support the skb split, merge, resize and
    clone operations possible with SOCK_STREAM and other socket types.

    The patch does not yet modify any datapaths.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Add sock_omalloc and sock_ofree to be able to allocate control skbs,
    for instance for looping errors onto sk_error_queue.

    The transmit budget (sk_wmem_alloc) is involved in transmit skb
    shaping, most notably in TCP Small Queues. Using this budget for
    control packets would impact transmission.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

02 Aug, 2017

1 commit

  • Add new proto_ops sendmsg_locked and sendpage_locked that can be
    called when the socket lock is already held. Correspondingly, add
    kernel_sendmsg_locked and kernel_sendpage_locked as front end
    functions.

    These functions will be used in zero proxy so that we can take
    the socket lock in a ULP sendmsg/sendpage and then directly call the
    backend transport proto_ops functions.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

19 Jul, 2017

1 commit

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     

13 Jul, 2017

1 commit


08 Jul, 2017

1 commit

  • sock_graft() unilaterally sets up parent->sk based on the
    assumption that the existing parent->sk is null. If this
    condition is not true, then the existing parent->sk would
    be leaked, so add a WARN_ON() to alert callers who may fall
    in this category.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

01 Jul, 2017

3 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

21 Jun, 2017

1 commit

  • for connected socket, the incoming_cpu field in the sock struct
    is not going to change frequently, but we are setting it
    unconditionally for each packet.

    Since sk_incoming_cpu and sk_flags share the same cacheline,
    and the latter is access by udp_recvmsg(), this cause a cache
    miss for each packet for UDP connected socket.

    With this patch, we set the incoming cpu field only when the
    ingress cpu really changes.

    This gives a small but measurable performance improvement for
    connected UDP socket.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Jun, 2017

1 commit

  • DRAM supply shortage and poor memory pressure tracking in TCP
    stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
    limits) and tcp_mem[] quite hazardous.

    TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
    limits being hit, but only tracking number of transitions.

    If TCP stack behavior under stress was perfect :
    1) It would maintain memory usage close to the limit.
    2) Memory pressure state would be entered for short times.

    We certainly prefer 100 events lasting 10ms compared to one event
    lasting 200 seconds.

    This patch adds a new SNMP counter tracking cumulative duration of
    memory pressure events, given in ms units.

    $ cat /proc/sys/net/ipv4/tcp_mem
    3088 4117 6176
    $ grep TCP /proc/net/sockstat
    TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
    $ nstat -n ; sleep 10 ; nstat |grep Pressure
    TcpExtTCPMemoryPressures 1700
    TcpExtTCPMemoryPressuresChrono 5209

    v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
    instructed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 May, 2017

1 commit

  • Mauro says:

    This patch series convert the remaining DocBooks to ReST.

    The first version was originally
    send as 3 patch series:

    [PATCH 00/36] Convert DocBook documents to ReST
    [PATCH 0/5] Convert more books to ReST
    [PATCH 00/13] Get rid of DocBook

    The lsm book was added as if it were a text file under
    Documentation. The plan is to merge it with another file
    under Documentation/security, after both this series and
    a security Documentation patch series gets merged.

    It also adjusts some Sphinx-pedantic errors/warnings on
    some kernel-doc markups.

    I also added some patches here to add PDF output for all
    existing ReST books.

    Jonathan Corbet
     

17 May, 2017

2 commits

  • BBR congestion control depends on pacing, and pacing is
    currently handled by sch_fq packet scheduler for performance reasons,
    and also because implemening pacing with FQ was convenient to truly
    avoid bursts.

    However there are many cases where this packet scheduler constraint
    is not practical.
    - Many linux hosts are not focusing on handling thousands of TCP
    flows in the most efficient way.
    - Some routers use fq_codel or other AQM, but still would like
    to use BBR for the few TCP flows they initiate/terminate.

    This patch implements an automatic fallback to internal pacing.

    Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

    If sch_fq happens to be in the egress path, pacing is delegated to
    the qdisc, otherwise pacing is done by TCP itself.

    One advantage of pacing from TCP stack is to get more precise rtt
    estimations, and less work done from TX completion, since TCP Small
    queue limits are not generally hit. Setups with single TX queue but
    many cpus might even benefit from this.

    Note that unlike sch_fq, we do not take into account header sizes.
    Taking care of these headers would add additional complexity for
    no practical differences in behavior.

    Some performance numbers using 800 TCP_STREAM flows rate limited to
    ~48 Mbit per second on 40Gbit NIC.

    If MQ+pfifo_fast is used on the NIC :

    $ sar -n DEV 1 5 | grep eth
    14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
    14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
    14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
    14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
    14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
    Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
    4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
    0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
    1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
    1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

    Now use MQ+FQ :

    lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
    lpaa23:~# tc qdisc replace dev eth0 root mq

    $ sar -n DEV 1 5 | grep eth
    14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
    14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
    14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
    14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
    14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
    Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
    1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
    0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
    4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
    3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

    As expected, number of interrupts per second is very different.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • And update __sk_queue_drop_skb() to work on the specified queue.
    This will help the udp protocol to use an additional private
    rx queue in a later patch.

    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

16 May, 2017

1 commit

  • Sphinx is very pedantic with regards to identation and
    escape sequences:

    ./include/net/sock.h:1967: ERROR: Unexpected indentation.
    ./include/net/sock.h:1969: ERROR: Unexpected indentation.
    ./include/net/sock.h:1970: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./include/net/sock.h:1971: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./include/net/sock.h:2268: WARNING: Inline emphasis start-string without end-string.
    ./net/core/sock.c:2686: ERROR: Unexpected indentation.
    ./net/core/sock.c:2687: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./net/core/datagram.c:182: WARNING: Inline emphasis start-string without end-string.
    ./include/linux/netdevice.h:1444: ERROR: Unexpected indentation.
    ./drivers/net/phy/phy.c:381: ERROR: Unexpected indentation.
    ./drivers/net/phy/phy.c:382: WARNING: Block quote ends without a blank line; unexpected unindent.

    - Fix spacing where needed;
    - Properly escape constants;
    - Use a literal block for a race description.

    No functional changes.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

23 Apr, 2017

1 commit


19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

03 Apr, 2017

1 commit


31 Mar, 2017

1 commit

  • sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
    every packet, even if the SOCK_TIMESTAMP flag is not set in the
    related socket.
    If selinux is enabled, this cause a cache miss for every packet
    since sk->sk_stamp and sk->sk_security share the same cacheline.
    With this change sk_stamp is set only if the SOCK_TIMESTAMP
    flag is set, and is cleared for the first packet, so that the user
    perceived behavior is unchanged.

    This gives up to 5% speed-up under udp-flood with small packets.

    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

23 Mar, 2017

1 commit

  • Allows reading of SK_MEMINFO_VARS via socket option. This way an
    application can get all meminfo related information in single socket
    option call instead of multiple calls.

    Adds helper function, sk_get_meminfo(), and uses that for both
    getsockopt and sock_diag_put_meminfo().

    Suggested by Eric Dumazet.

    Signed-off-by: Josh Hunt
    Reviewed-by: Jason Baron
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Josh Hunt
     

16 Mar, 2017

1 commit


10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

09 Mar, 2017

1 commit


03 Mar, 2017

1 commit

  • When handling problems in cloning a socket with the sk_clone_locked()
    function we need to perform several steps that were open coded in it and
    its callers, so introduce a routine to avoid this duplication:
    sk_free_unlock_clone().

    Cc: Cong Wang
    Cc: Dmitry Vyukov
    Cc: Eric Dumazet
    Cc: Gerrit Renker
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

08 Feb, 2017

4 commits

  • The conflict was an interaction between a bug fix in the
    netvsc driver in 'net' and an optimization of the RX path
    in 'net-next'.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add new skbuff flag to allow protocols to confirm neighbour.
    When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.

    Add sock_confirm_neigh() helper to confirm the neighbour and
    use it for IPv4, IPv6 and VRF before dst_neigh_output.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Add new sock flag to allow sockets to confirm neighbour.
    When same struct dst_entry can be used for many different
    neighbours we can not use it for pending confirmations.
    As not all call paths lock the socket use full word for
    the flag.

    Add sk_dst_confirm as replacement for dst_confirm when
    called for received packets.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Dmitry reported that UDP sockets being destroyed would trigger the
    WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct()

    It turns out we do not properly destroy skb(s) that have wrong UDP
    checksum.

    Thanks again to syzkaller team.

    Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Cc: Hannes Frederic Sowa
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jan, 2017

1 commit

  • Slava Shwartsman reported a warning in skb_try_coalesce(), when we
    detect skb->truesize is completely wrong.

    In his case, issue came from IPv6 reassembly coping with malicious
    datagrams, that forced various pskb_may_pull() to reallocate a bigger
    skb->head than the one allocated by NIC driver before entering GRO
    layer.

    Current code does not change skb->truesize, leaving this burden to
    callers if they care enough.

    Blindly changing skb->truesize in pskb_expand_head() is not
    easy, as some producers might track skb->truesize, for example
    in xmit path for back pressure feedback (sk->sk_wmem_alloc)

    We can detect the cases where it should be safe to change
    skb->truesize :

    1) skb is not attached to a socket.
    2) If it is attached to a socket, destructor is sock_edemux()

    My audit gave only two callers doing their own skb->truesize
    manipulation.

    I had to remove skb parameter in sock_edemux macro when
    CONFIG_INET is not set to avoid a compile error.

    Signed-off-by: Eric Dumazet
    Reported-by: Slava Shwartsman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jan, 2017

1 commit