17 Nov, 2016

1 commit

  • The IOP_XATTR flag is set on sockfs because sockfs supports getting the
    "system.sockprotoname" xattr. Since commit 6c6ef9f2, this flag is checked for
    setxattr support as well. This is wrong on sockfs because security xattr
    support there is supposed to be provided by security_inode_setsecurity. The
    smack security module relies on socket labels (xattrs).

    Fix this by adding a security xattr handler on sockfs that returns
    -EAGAIN, and by checking for -EAGAIN in setxattr.

    We cannot simply check for -EOPNOTSUPP in setxattr because there are
    filesystems that neither have direct security xattr support nor support
    via security_inode_setsecurity. A more proper fix might be to move the
    call to security_inode_setsecurity into sockfs, but it's not clear to me
    if that is safe: we would end up calling security_inode_post_setxattr after
    that as well.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

10 Nov, 2016

1 commit

  • Do not send the next message in sendmmsg for partial sendmsg
    invocations.

    sendmmsg assumes that it can continue sending the next message
    when the return value of the individual sendmsg invocations
    is positive. It results in corrupting the data for TCP,
    SCTP, and UNIX streams.

    For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
    of "aefgh" if the first sendmsg invocation sends only the first
    byte while the second sendmsg goes through.

    Datagram sockets either send the entire datagram or fail, so
    this patch affects only sockets of type SOCK_STREAM and
    SOCK_SEQPACKET.

    Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Neal Cardwell
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

08 Oct, 2016

1 commit


07 Oct, 2016

2 commits


20 May, 2016

1 commit

  • struct timespec is not y2038 safe. Even though timespec might be
    sufficient to represent timeouts, use struct timespec64 here as the plan
    is to get rid of all timespec reference in the kernel.

    The patch transitions the common functions: poll_select_set_timeout()
    and select_estimate_accuracy() to use timespec64. And, all the syscalls
    that use these functions are transitioned in the same patch.

    The restart block parameters for poll uses monotonic time. Use
    timespec64 here as well to assign timeout value. This parameter in the
    restart block need not change because this only holds the monotonic
    timestamp at which timeout should occur. And, unsigned long data type
    should be big enough for this timestamp.

    The system call interfaces will be handled in a separate series.

    Compat interfaces need not change as timespec64 is an alias to struct
    timespec on a 64 bit system.

    Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Acked-by: David S. Miller
    Cc: Alexander Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

29 Apr, 2016

1 commit

  • The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
    the timestamp of the TCP acknowledgement should be reported on
    error queue. Since accessing skb_shinfo is likely to incur a
    cache-line miss at the time of receiving the ack, the
    txstamp_ack bit was added in tcp_skb_cb, which is set iff
    the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
    SKBTX_ACK_TSTAMP flag redundant.

    Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
    everywhere.

    Note that this frees one bit in shinfo->tx_flags.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Martin KaFai Lau
    Suggested-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

14 Apr, 2016

1 commit


11 Apr, 2016

1 commit


08 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
    This is very costly when users want to sample writes to gather
    tx timestamps.

    Add support for enabling SO_TIMESTAMPING via control messages by
    using tsflags added in `struct sockcm_cookie` (added in the previous
    patches in this series) to set the tx_flags of the last skb created in
    a sendmsg. With this patch, the timestamp recording bits in tx_flags
    of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

    Please note that this is only effective for overriding the recording
    timestamps flags. Users should enable timestamp reporting (e.g.,
    SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
    socket options and then should ask for SOF_TIMESTAMPING_TX_*
    using control messages per sendmsg to sample timestamps for each
    write.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

29 Mar, 2016

1 commit


15 Mar, 2016

1 commit

  • The syzkaller fuzzer hit the following use-after-free:

    Call Trace:
    [] __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:295
    [] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
    [< inline >] SYSC_recvmmsg net/socket.c:2281
    [] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
    [] entry_SYSCALL_64_fastpath+0x16/0x7a
    arch/x86/entry/entry_64.S:185

    And, as Dmitry rightly assessed, that is because we can drop the
    reference and then touch it when the underlying recvmsg calls return
    some packets and then hit an error, which will make recvmmsg to set
    sock->sk->sk_err, oops, fix it.

    Reported-and-Tested-by: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Eric Dumazet
    Cc: Kostya Serebryany
    Cc: Sasha Levin
    Fixes: a2e2725541fa ("net: Introduce recvmmsg socket syscall")
    http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

14 Mar, 2016

1 commit


10 Mar, 2016

3 commits

  • Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
    indicate that more messages will follow (i.e. a batch of messages is
    being sent). This is similar to MSG_MORE except that the following
    messages are not merged into one packet, they are sent individually.
    sendmmsg is updated so that each contained message except for the
    last one is marked as MSG_BATCH.

    MSG_BATCH is a performance optimization in cases where a socket
    implementation can benefit by transmitting packets in a batch.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch allows setting MSG_EOR in each individual msghdr passed
    in sendmmsg. This allows a sendmmsg to send multiple messages when
    using SOCK_SEQPACKET.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Export it for cases where we want to create sockets by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Jan, 2016

1 commit

  • Applications often have to reduce number of datagrams
    they receive or send per system call to avoid starvation problems.

    Really the kernel should take care of this by using cond_resched(),
    so that applications can experiment bigger batch sizes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Dec, 2015

1 commit

  • Commit ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") from
    the current 4.4 release cycle introduced a new flags member in
    struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
    from struct socket's flags member into that new place.

    Unfortunately, the new flags field is never initialized properly, at least
    not for the struct socket_wq instance created in sock_alloc_inode().

    One particular issue I encountered because of this is that my GNU Emacs
    failed to draw anything on my desktop -- i.e. what I got is a transparent
    window, including the title bar. Bisection lead to the commit mentioned
    above and further investigation by means of strace told me that Emacs
    is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
    reproducible 100% of times and the fact that properly initializing the
    struct socket_wq ->flags fixes the issue leads me to the conclusion that
    somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
    preventing my Emacs from receiving any SIGIO's due to data becoming
    available and it got stuck.

    Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
    member to zero.

    Fixes: ceb5d58b2170 ("net: fix sock_wake_async() rcu protection")
    Signed-off-by: Nicolai Stange
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nicolai Stange
     

16 Dec, 2015

1 commit


02 Dec, 2015

2 commits

  • Dmitry provided a syzkaller (http://github.com/google/syzkaller)
    triggering a fault in sock_wake_async() when async IO is requested.

    Said program stressed af_unix sockets, but the issue is generic
    and should be addressed in core networking stack.

    The problem is that by the time sock_wake_async() is called,
    we should not access the @flags field of 'struct socket',
    as the inode containing this socket might be freed without
    further notice, and without RCU grace period.

    We already maintain an RCU protected structure, "struct socket_wq"
    so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
    is the safe route.

    It also reduces number of cache lines needing dirtying, so might
    provide a performance improvement anyway.

    In followup patches, we might move remaining flags (SOCK_NOSPACE,
    SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
    being mostly read and let it being shared between cpus.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Sep, 2015

1 commit


11 May, 2015

2 commits

  • This is long overdue, and is part of cleaning up how we allocate kernel
    sockets that don't reference count struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is no need for tun to do the weird network namespace refcounting.
    The existing network namespace refcounting in tfile has almost exactly
    the same lifetime. So rewrite the code to use the struct sock network
    namespace refcounting and remove the unnecessary hand rolled network
    namespace refcounting and the unncesary tfile->net.

    This change allows the tun code to directly call sock_put bypassing
    sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

    Remove the now unncessary tun_release so that if anything tries to use
    the sock_release code path the kernel will oops, and let us know about
    the bug.

    The macvtap code already uses it's internal socket this way.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

16 Apr, 2015

1 commit


12 Apr, 2015

3 commits


09 Apr, 2015

4 commits


24 Mar, 2015

1 commit


21 Mar, 2015

2 commits

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    net/core/sysctl_net_core.c
    net/ipv4/inet_diag.c

    The be_main.c conflict resolution was really tricky. The conflict
    hunks generated by GIT were very unhelpful, to say the least. It
    split functions in half and moved them around, when the real actual
    conflict only existed solely inside of one function, that being
    be_map_pci_bars().

    So instead, to resolve this, I checked out be_main.c from the top
    of net-next, then I applied the be_main.c changes from 'net' since
    the last time I merged. And this worked beautifully.

    The inet_diag.c and sysctl_net_core.c conflicts were simple
    overlapping changes, and were easily to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Cc: stable@vger.kernel.org # v3.19
    Signed-off-by: Al Viro
    Signed-off-by: David S. Miller

    Al Viro
     

14 Mar, 2015

1 commit

  • The AIO interface is fairly complex because it tries to allow
    filesystems to always work async and then wakeup a synchronous
    caller through aio_complete. It turns out that basically no one
    was doing this to avoid the complexity and context switches,
    and we've already fixed up the remaining users and can now
    get rid of this case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

13 Mar, 2015

1 commit