01 Aug, 2018

1 commit

  • We never use RCU protection for it, just a lot of cargo-cult
    rcu_deference_protects calls.

    Note that we do keep the kfree_rcu call for it, as the references through
    struct sock are RCU protected and thus might require a grace period before
    freeing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Dumazet
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Jun, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Add Maglev hashing scheduler to IPVS, from Inju Song.

    2) Lots of new TC subsystem tests from Roman Mashak.

    3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

    4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

    5) Add ttl inherit support to vxlan, from Hangbin Liu.

    6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

    7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

    8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

    9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

    10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

    11) Plumb extack down into fib_rules, from Roopa Prabhu.

    12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

    13) Add UDP GSO support, from Willem de Bruijn.

    14) Add documentation for eBPF helpers, from Quentin Monnet.

    15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

    16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

    17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

    18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

    19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

    20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

    21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

    22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

    23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

    24) Add TCP SACK compression, from Eric Dumazet.

    25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

    26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

    27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

    28) Add lots of forwarding selftests, from Petr Machata.

    29) Add generic network device failover driver, from Sridhar Samudrala.

    * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
    strparser: Add __strp_unpause and use it in ktls.
    rxrpc: Fix terminal retransmission connection ID to include the channel
    net: hns3: Optimize PF CMDQ interrupt switching process
    net: hns3: Fix for VF mailbox receiving unknown message
    net: hns3: Fix for VF mailbox cannot receiving PF response
    bnx2x: use the right constant
    Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
    net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
    enic: fix UDP rss bits
    netdev-FAQ: clarify DaveM's position for stable backports
    rtnetlink: validate attributes in do_setlink()
    mlxsw: Add extack messages for port_{un, }split failures
    netdevsim: Add extack error message for devlink reload
    devlink: Add extack to reload and port_{un, }split operations
    net: metrics: add proper netlink validation
    ipmr: fix error path when ipmr_new_table fails
    ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
    net: hns3: remove unused hclgevf_cfg_func_mta_filter
    netfilter: provide udp*_lib_lookup for nf_tproxy
    qed*: Utilize FW 8.37.2.0
    ...

    Linus Torvalds
     

26 May, 2018

1 commit


17 Apr, 2018

1 commit

  • Applications might use SO_RCVLOWAT on TCP socket hoping to receive
    one [E]POLLIN event only when a given amount of bytes are ready in socket
    receive queue.

    Problem is that receive autotuning is not aware of this constraint,
    meaning sk_rcvbuf might be too small to allow all bytes to be stored.

    Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
    can override the default setsockopt(SO_RCVLOWAT) behavior.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Mar, 2018

1 commit

  • Fun set of conflict resolutions here...

    For the mac80211 stuff, these were fortunately just parallel
    adds. Trivially resolved.

    In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
    function phy_disable_interrupts() earlier in the file, whilst in
    'net-next' the phy_error() call from this function was removed.

    In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
    'rt_table_id' member of rtable collided with a bug fix in 'net' that
    added a new struct member "rt_mtu_locked" which needs to be copied
    over here.

    The mlxsw driver conflict consisted of net-next separating
    the span code and definitions into separate files, whilst
    a 'net' bug fix made some changes to that moved code.

    The mlx5 infiniband conflict resolution was quite non-trivial,
    the RDMA tree's merge commit was used as a guide here, and
    here are their notes:

    ====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch. This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
    drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524
    (IB/mlx5: Fix cleanup order on unload) added to for-rc and
    commit b5ca15ad7e61 (IB/mlx5: Add proper representors support)
    add as part of the devel cycle both needed to modify the
    init/de-init functions used by mlx5. To support the new
    representors, the new functions added by the cleanup patch
    needed to be made non-static, and the init/de-init list
    added by the representors patch needed to be modified to
    match the init/de-init list changes made by the cleanup
    patch.
    Updates:
    drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
    prototypes added by representors patch to reflect new function
    names as changed by cleanup patch
    drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
    stage list to match new order from cleanup patch
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

12 Mar, 2018

1 commit

  • Now when using 'ss' in iproute, kernel would try to load all _diag
    modules, which also causes corresponding family and proto modules
    to be loaded as well due to module dependencies.

    Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
    would be loaded.

    For example:

    $ lsmod|grep sctp
    $ ss
    $ lsmod|grep sctp
    sctp_diag 16384 0
    sctp 323584 5 sctp_diag
    inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag
    libcrc32c 16384 3 nf_conntrack,nf_nat,sctp

    As these family and proto modules are loaded unintentionally, it
    could cause some problems, like:

    - Some debug tools use 'ss' to collect the socket info, which loads all
    those diag and family and protocol modules. It's noisy for identifying
    issues.

    - Users usually expect to drop sctp init packet silently when they
    have no sense of sctp protocol instead of sending abort back.

    - It wastes resources (especially with multiple netns), and SCTP module
    can't be unloaded once it's loaded.

    ...

    In short, it's really inappropriate to have these family and proto
    modules loaded unexpectedly when just doing debugging with inet_diag.

    This patch is to introduce sock_load_diag_module() where it loads
    the _diag module only when it's corresponding family or proto has
    been already registered.

    Note that we can't just load _diag module without the family or
    proto loaded, as some symbols used in _diag module are from the
    family or proto module.

    v1->v2:
    - move inet proto check to inet_diag to avoid a compiling err.
    v2->v3:
    - define sock_load_diag_module in sock.c and export one symbol
    only.
    - improve the changelog.

    Reported-by: Sabrina Dubroca
    Acked-by: Marcelo Ricardo Leitner
    Acked-by: Phil Sutter
    Acked-by: Sabrina Dubroca
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

13 Feb, 2018

1 commit

  • Changes since v1:
    Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

    Before:
    All these functions either return a negative error indicator,
    or store length of sockaddr into "int *socklen" parameter
    and return zero on success.

    "int *socklen" parameter is awkward. For example, if caller does not
    care, it still needs to provide on-stack storage for the value
    it does not need.

    None of the many FOO_getname() functions of various protocols
    ever used old value of *socklen. They always just overwrite it.

    This change drops this parameter, and makes all these functions, on success,
    return length of sockaddr. It's always >= 0 and can be differentiated
    from an error.

    Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

    rpc_sockname() lost "int buflen" parameter, since its only use was
    to be passed to kernel_getsockname() as &buflen and subsequently
    not used in any way.

    Userspace API is not changed.

    text data bss dec hex filename
    30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
    30108109 2633612 873672 33615393 200ee21 vmlinux.o

    Signed-off-by: Denys Vlasenko
    CC: David S. Miller
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    CC: linux-bluetooth@vger.kernel.org
    CC: linux-decnet-user@lists.sourceforge.net
    CC: linux-wireless@vger.kernel.org
    CC: linux-rdma@vger.kernel.org
    CC: linux-sctp@vger.kernel.org
    CC: linux-nfs@vger.kernel.org
    CC: linux-x25@vger.kernel.org
    Signed-off-by: David S. Miller

    Denys Vlasenko
     

01 Feb, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

    2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

    3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

    4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

    5) Add eBPF based queue selection to tun, from Jason Wang.

    6) Lockless qdisc support, from John Fastabend.

    7) SCTP stream interleave support, from Xin Long.

    8) Smoother TCP receive autotuning, from Eric Dumazet.

    9) Lots of erspan tunneling enhancements, from William Tu.

    10) Add true function call support to BPF, from Alexei Starovoitov.

    11) Add explicit support for GRO HW offloading, from Michael Chan.

    12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

    13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

    14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

    15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

    16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

    17) Add resource abstration to devlink, from Arkadi Sharshevsky.

    18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

    19) Avoid locking in act_csum, from Davide Caratti.

    20) devinet_ioctl() simplifications from Al viro.

    21) More TCP bpf improvements from Lawrence Brakmo.

    22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
    tls: Add support for encryption using async offload accelerator
    ip6mr: fix stale iterator
    net/sched: kconfig: Remove blank help texts
    openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
    tcp_nv: fix potential integer overflow in tcpnv_acked
    r8169: fix RTL8168EP take too long to complete driver initialization.
    qmi_wwan: Add support for Quectel EP06
    rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
    ipmr: Fix ptrdiff_t print formatting
    ibmvnic: Wait for device response when changing MAC
    qlcnic: fix deadlock bug
    tcp: release sk_frag.page in tcp_disconnect
    ipv4: Get the address of interface correctly.
    net_sched: gen_estimator: fix lockdep splat
    net: macb: Handle HRESP error
    net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
    ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    ipv6: change route cache aging logic
    i40e/i40evf: Update DESC_NEEDED value to reflect larger value
    bnxt_en: cleanup DIM work on device shutdown
    ...

    Linus Torvalds
     

25 Jan, 2018

1 commit


28 Nov, 2017

1 commit


16 Nov, 2017

1 commit

  • Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     

16 Aug, 2017

2 commits


02 Aug, 2017

1 commit

  • Add new proto_ops sendmsg_locked and sendpage_locked that can be
    called when the socket lock is already held. Correspondingly, add
    kernel_sendmsg_locked and kernel_sendpage_locked as front end
    functions.

    These functions will be used in zero proxy so that we can take
    the socket lock in a ULP sendmsg/sendpage and then directly call the
    backend transport proto_ops functions.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

20 Jun, 2017

1 commit


18 Apr, 2017

1 commit

  • The MTU overhead calculation in L2TP device set-up
    merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
    needs to be adjusted to lock the tunnel socket while
    referencing the sub-data structures to derive the
    socket's IP overhead.

    Reported-by: Guillaume Nault
    Tested-by: Guillaume Nault
    Signed-off-by: R. Parameswaran
    Signed-off-by: David S. Miller

    R. Parameswaran
     

07 Apr, 2017

1 commit

  • A new function, kernel_sock_ip_overhead(), is provided
    to calculate the cumulative overhead imposed by the IP
    Header and IP options, if any, on a socket's payload.
    The new function returns an overhead of zero for sockets
    that do not belong to the IPv4 or IPv6 address families.
    This is used in the L2TP code path to compute the
    total outer IP overhead on the L2TP tunnel socket when
    calculating the default MTU for Ethernet pseudowires.

    Signed-off-by: R. Parameswaran
    Signed-off-by: David S. Miller

    R. Parameswaran
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

29 Aug, 2016

1 commit

  • Add new function in proto_ops structure. This includes moving the
    typedef got sk_read_actor into net.h and removing the definition from
    tcp.h.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

01 Jul, 2016

1 commit

  • We used to queue tx packets in sk_receive_queue, this is less
    efficient since it requires spinlocks to synchronize between producer
    and consumer.

    This patch tries to address this by:

    - switch from sk_receive_queue to a skb_array, and resize it when
    tx_queue_len was changed.
    - introduce a new proto_ops peek_len which was used for peeking the
    skb length.
    - implement a tun version of peek_len for vhost_net to use and convert
    vhost_net to use peek_len if possible.

    Pktgen test shows about 15.3% improvement on guest receiving pps for small
    buffers:

    Before: ~1300000pps
    After : ~1500000pps

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

16 Jun, 2016

1 commit

  • The implementation of net_dbg_ratelimited in the CONFIG_DYNAMIC_DEBUG
    case was added with 2c94b5373 ("net: Implement net_dbg_ratelimited() for
    CONFIG_DYNAMIC_DEBUG case"). The implementation strategy was to take the
    usual definition of the dynamic_pr_debug macro, but alter it by adding a
    call to "net_ratelimit()" in the if statement. This is, in fact, the
    correct approach.

    However, while doing this, the author of the commit forgot to surround
    fmt by pr_fmt, resulting in unprefixed log messages appearing in the
    console. So, this commit adds back the pr_fmt(fmt) invocation, making
    net_dbg_ratelimited properly consistent across DEBUG, no DEBUG, and
    DYNAMIC_DEBUG cases, and bringing parity with the behavior of
    dynamic_pr_debug as well.

    Fixes: 2c94b5373 ("net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case")
    Signed-off-by: Jason A. Donenfeld
    Cc: Tim Bingham
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

04 May, 2016

1 commit


02 May, 2016

1 commit

  • Prior to commit d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op
    when !DEBUG") the implementation of net_dbg_ratelimited() was buggy
    for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases.

    The bug was that net_ratelimit() was being called and, despite
    returning true, nothing was being printed to the console. This
    resulted in messages like the following -

    "net_ratelimit: %d callbacks suppressed"

    with no other output nearby.

    After commit d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op when
    !DEBUG") the bug is fixed for the DEBUG case. However, there's no
    output at all for CONFIG_DYNAMIC_DEBUG case.

    This patch restores debug output (if enabled) for the
    CONFIG_DYNAMIC_DEBUG case.

    Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG
    case. The implementation takes care to check that dynamic debugging is
    enabled before calling net_ratelimit().

    Fixes: d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op when !DEBUG")
    Signed-off-by: Tim Bingham
    Signed-off-by: David S. Miller

    Tim Bingham
     

29 Mar, 2016

1 commit


10 Mar, 2016

1 commit


02 Dec, 2015

2 commits

  • Dmitry provided a syzkaller (http://github.com/google/syzkaller)
    triggering a fault in sock_wake_async() when async IO is requested.

    Said program stressed af_unix sockets, but the issue is generic
    and should be addressed in core networking stack.

    The problem is that by the time sock_wake_async() is called,
    we should not access the @flags field of 'struct socket',
    as the inode containing this socket might be freed without
    further notice, and without RCU grace period.

    We already maintain an RCU protected structure, "struct socket_wq"
    so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
    is the safe route.

    It also reduces number of cache lines needing dirtying, so might
    provide a performance improvement anyway.

    In followup patches, we might move remaining flags (SOCK_NOSPACE,
    SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
    being mostly read and let it being shared between cpus.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2015

1 commit

  • There's no good reason why users outside of networking should not
    be using this facility, f.e. for initializing their seeds.

    Therefore, make it accessible from there as get_random_once().

    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

07 Aug, 2015

1 commit

  • The pr_debug family of functions turns into a no-op when -DDEBUG is not
    specified, opting instead to call "no_printk", which gets compiled to a
    no-op (but retains gcc's nice warnings about printf-style arguments).

    The problem with net_dbg_ratelimited is that it is defined to be a
    variant of net_ratelimited_function, which expands to essentially:

    if (net_ratelimit())
    pr_debug(fmt, ...);

    When DEBUG is not defined, then this becomes,

    if (net_ratelimit())
    ;

    This seems benign, except it isn't. Firstly, there's the obvious
    overhead of calling net_ratelimit needlessly, which does quite some book
    keeping for the rate limiting. Given that the pr_debug and
    net_dbg_ratelimited family of functions are sprinkled liberally through
    performance critical code, with developers assuming they'll be compiled
    out to a no-op most of the time, we certainly do not want this needless
    book keeping. Secondly, and most visibly, even though no debug message
    is printed when DEBUG is not defined, if there is a flood of
    invocations, dmesg winds up peppered with messages such as
    "net_ratelimit: 320 callbacks suppressed". This is because our
    aforementioned net_ratelimit() function actually prints this text in
    some circumstances. It's especially odd to see this when there isn't any
    other accompanying debug message.

    So, in sum, it doesn't make sense to have this function's current
    behavior, and instead it should match what every other debug family of
    functions in the kernel does with !DEBUG -- nothing.

    This patch replaces calls to net_dbg_ratelimited when !DEBUG with
    no_printk, keeping with the idiom of all the other debug print helpers.

    Also, though not strictly neccessary, it guards the call with an if (0)
    so that all evaluation of any arguments are sure to be compiled out.

    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

11 May, 2015

2 commits

  • This is long overdue, and is part of cleaning up how we allocate kernel
    sockets that don't reference count struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is no need for tun to do the weird network namespace refcounting.
    The existing network namespace refcounting in tfile has almost exactly
    the same lifetime. So rewrite the code to use the struct sock network
    namespace refcounting and remove the unnecessary hand rolled network
    namespace refcounting and the unncesary tfile->net.

    This change allows the tun code to directly call sock_put bypassing
    sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

    Remove the now unncessary tun_release so that if anything tries to use
    the sock_release code path the kernel will oops, and let us know about
    the bug.

    The macvtap code already uses it's internal socket this way.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

12 Apr, 2015

1 commit


03 Mar, 2015

1 commit

  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

14 May, 2014

1 commit

  • net_get_random_once depends on the static keys infrastructure to patch up
    the branch to the slow path during boot. This was realized by abusing the
    static keys api and defining a new initializer to not enable the call
    site while still indicating that the branch point should get patched
    up. This was needed to have the fast path considered likely by gcc.

    The static key initialization during boot up normally walks through all
    the registered keys and either patches in ideal nops or enables the jump
    site but omitted that step on x86 if ideal nops where already placed at
    static_key branch points. Thus net_get_random_once branches not always
    became active.

    This patch switches net_get_random_once to the ordinary static_key
    api and thus places the kernel fast path in the - by gcc considered -
    unlikely path. Microbenchmarks on Intel and AMD x86-64 showed that
    the unlikely path actually beats the likely path in terms of cycle cost
    and that different nop patterns did not make much difference, thus this
    switch should not be noticeable.

    Fixes: a48e42920ff38b ("net: introduce new macro net_get_random_once")
    Reported-by: Tuomas Räsänen
    Cc: Linus Torvalds
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

15 Jan, 2014

1 commit


11 Dec, 2013

1 commit

  • unix_dgram_recvmsg() will hold the readlock of the socket until recv
    is complete.

    In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
    unix_dgram_recvmsg() will complete (which can take a while) without allowing
    us to break out of it, triggering a hung task spew.

    Instead, allow set_peek_off to fail, this way userspace will not hang.

    Signed-off-by: Sasha Levin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Sasha Levin
     

21 Nov, 2013

1 commit


26 Oct, 2013

1 commit

  • I initial build non irq safe version of net_get_random_once because I
    would liked to have the freedom to defer even the extraction process of
    get_random_bytes until the nonblocking pool is fully seeded.

    I don't think this is a good idea anymore and thus this patch makes
    net_get_random_once irq safe. Now someone using net_get_random_once does
    not need to care from where it is called.

    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

22 Oct, 2013

1 commit

  • This patch fixes the following warning:

    In file included from include/linux/skbuff.h:27:0,
    from include/linux/netfilter.h:5,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:20,
    from include/linux/init_task.h:14,
    from init/init_task.c:1:
    include/linux/net.h:243:14: warning: 'struct static_key' declared inside parameter list [enabled by default]
    struct static_key *done_key);

    on x86_64 allnoconfig, um defconfig and ia64 allmodconfig and maybe others as well.

    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa