22 Feb, 2017

1 commit

  • Commit 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path"),
    changed the exit path of recvmmsg to always return the datagrams
    variable and modified the error paths to set the variable to the error
    code returned by recvmsg if necessary.

    However in the case sock_error returned an error, the error code was
    then ignored, and recvmmsg returned 0.

    Change the error path of recvmmsg to correctly return the error code
    of sock_error.

    The bug was triggered by using recvmmsg on a CAN interface which was
    not up. Linux 4.6 and later return 0 in this case while earlier
    releases returned -ENETDOWN.

    Fixes: 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path")
    Signed-off-by: Maxime Jayat
    Signed-off-by: David S. Miller

    Maxime Jayat
     

12 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • Make sockfs_setattr() static as it is not used outside of net/socket.c

    This fixes the following GCC warning:
    net/socket.c:534:5: warning: no previous prototype for ‘sockfs_setattr’ [-Wmissing-prototypes]

    Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
    Cc: Lorenzo Colitti
    Signed-off-by: Tobias Klauser
    Acked-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Tobias Klauser
     

10 Jan, 2017

1 commit


06 Jan, 2017

1 commit


05 Jan, 2017

1 commit


02 Jan, 2017

1 commit

  • ->setattr() was recently implemented for socket files to sync the socket
    inode's uid to the new 'sk_uid' member of struct sock. It does this by
    copying over the ia_uid member of struct iattr. However, ia_uid is
    actually only valid when ATTR_UID is set in ia_valid, indicating that
    the uid is being changed, e.g. by chown. Other metadata operations such
    as chmod or utimes leave ia_uid uninitialized. Therefore, sk_uid could
    be set to a "garbage" value from the stack.

    Fix this by only copying the uid over when ATTR_UID is set.

    Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
    Signed-off-by: Eric Biggers
    Tested-by: Lorenzo Colitti
    Acked-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Eric Biggers
     

26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

1 commit


11 Dec, 2016

1 commit


09 Dec, 2016

1 commit


30 Nov, 2016

1 commit

  • This patch exports the sender chronograph stats via the socket
    SO_TIMESTAMPING channel. Currently we can instrument how long a
    particular application unit of data was queued in TCP by tracking
    SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
    these sender chronograph stats exported simultaneously along with
    these timestamps allow further breaking down the various sender
    limitation. For example, a video server can tell if a particular
    chunk of video on a connection takes a long time to deliver because
    TCP was experiencing small receive window. It is not possible to
    tell before this patch without packet traces.

    To prepare these stats, the user needs to set
    SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
    while requesting other SOF_TIMESTAMPING TX timestamps. When the
    timestamps are available in the error queue, the stats are returned
    in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
    in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
    TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

    Signed-off-by: Francis Yan
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Francis Yan
     

23 Nov, 2016

1 commit

  • All conflicts were simple overlapping changes except perhaps
    for the Thunder driver.

    That driver has a change_mtu method explicitly for sending
    a message to the hardware. If that fails it returns an
    error.

    Normally a driver doesn't need an ndo_change_mtu method becuase those
    are usually just range changes, which are now handled generically.
    But since this extra operation is needed in the Thunder driver, it has
    to stay.

    However, if the message send fails we have to restore the original
    MTU before the change because the entire call chain expects that if
    an error is thrown by ndo_change_mtu then the MTU did not change.
    Therefore code is added to nicvf_change_mtu to remember the original
    MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Nov, 2016

1 commit

  • The IOP_XATTR flag is set on sockfs because sockfs supports getting the
    "system.sockprotoname" xattr. Since commit 6c6ef9f2, this flag is checked for
    setxattr support as well. This is wrong on sockfs because security xattr
    support there is supposed to be provided by security_inode_setsecurity. The
    smack security module relies on socket labels (xattrs).

    Fix this by adding a security xattr handler on sockfs that returns
    -EAGAIN, and by checking for -EAGAIN in setxattr.

    We cannot simply check for -EOPNOTSUPP in setxattr because there are
    filesystems that neither have direct security xattr support nor support
    via security_inode_setsecurity. A more proper fix might be to move the
    call to security_inode_setsecurity into sockfs, but it's not clear to me
    if that is safe: we would end up calling security_inode_post_setxattr after
    that as well.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

15 Nov, 2016

1 commit


10 Nov, 2016

1 commit

  • Do not send the next message in sendmmsg for partial sendmsg
    invocations.

    sendmmsg assumes that it can continue sending the next message
    when the return value of the individual sendmsg invocations
    is positive. It results in corrupting the data for TCP,
    SCTP, and UNIX streams.

    For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
    of "aefgh" if the first sendmsg invocation sends only the first
    byte while the second sendmsg goes through.

    Datagram sockets either send the entire datagram or fail, so
    this patch affects only sockets of type SOCK_STREAM and
    SOCK_SEQPACKET.

    Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Neal Cardwell
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

05 Nov, 2016

1 commit

  • Protocol sockets (struct sock) don't have UIDs, but most of the
    time, they map 1:1 to userspace sockets (struct socket) which do.

    Various operations such as the iptables xt_owner match need
    access to the "UID of a socket", and do so by following the
    backpointer to the struct socket. This involves taking
    sk_callback_lock and doesn't work when there is no socket
    because userspace has already called close().

    Simplify this by adding a sk_uid field to struct sock whose value
    matches the UID of the corresponding struct socket. The semantics
    are as follows:

    1. Whenever sk_socket is non-null: sk_uid is the same as the UID
    in sk_socket, i.e., matches the return value of sock_i_uid.
    Specifically, the UID is set when userspace calls socket(),
    fchown(), or accept().
    2. When sk_socket is NULL, sk_uid is defined as follows:
    - For a socket that no longer has a sk_socket because
    userspace has called close(): the previous UID.
    - For a cloned socket (e.g., an incoming connection that is
    established but on which userspace has not yet called
    accept): the UID of the socket it was cloned from.
    - For a socket that has never had an sk_socket: UID 0 inside
    the user namespace corresponding to the network namespace
    the socket belongs to.

    Kernel sockets created by sock_create_kern are a special case
    of #1 and sk_uid is the user that created them. For kernel
    sockets created at network namespace creation time, such as the
    per-processor ICMP and TCP sockets, this is the user that created
    the network namespace.

    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

31 Oct, 2016

1 commit

  • Each socket operates in a network namespace where it has been created,
    so if we want to dump and restore a socket, we have to know its network
    namespace.

    We have a socket_diag to get information about sockets, it doesn't
    report sockets which are not bound or connected.

    This patch introduces a new socket ioctl, which is called SIOCGSKNS
    and used to get a file descriptor for a socket network namespace.

    A task must have CAP_NET_ADMIN in a target network namespace to
    use this ioctl.

    Cc: "David S. Miller"
    Cc: Eric W. Biederman
    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

08 Oct, 2016

1 commit


07 Oct, 2016

2 commits


20 May, 2016

1 commit

  • struct timespec is not y2038 safe. Even though timespec might be
    sufficient to represent timeouts, use struct timespec64 here as the plan
    is to get rid of all timespec reference in the kernel.

    The patch transitions the common functions: poll_select_set_timeout()
    and select_estimate_accuracy() to use timespec64. And, all the syscalls
    that use these functions are transitioned in the same patch.

    The restart block parameters for poll uses monotonic time. Use
    timespec64 here as well to assign timeout value. This parameter in the
    restart block need not change because this only holds the monotonic
    timestamp at which timeout should occur. And, unsigned long data type
    should be big enough for this timestamp.

    The system call interfaces will be handled in a separate series.

    Compat interfaces need not change as timespec64 is an alias to struct
    timespec on a 64 bit system.

    Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Acked-by: David S. Miller
    Cc: Alexander Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

29 Apr, 2016

1 commit

  • The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
    the timestamp of the TCP acknowledgement should be reported on
    error queue. Since accessing skb_shinfo is likely to incur a
    cache-line miss at the time of receiving the ack, the
    txstamp_ack bit was added in tcp_skb_cb, which is set iff
    the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
    SKBTX_ACK_TSTAMP flag redundant.

    Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
    everywhere.

    Note that this frees one bit in shinfo->tx_flags.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Martin KaFai Lau
    Suggested-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

14 Apr, 2016

1 commit


11 Apr, 2016

1 commit


08 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
    This is very costly when users want to sample writes to gather
    tx timestamps.

    Add support for enabling SO_TIMESTAMPING via control messages by
    using tsflags added in `struct sockcm_cookie` (added in the previous
    patches in this series) to set the tx_flags of the last skb created in
    a sendmsg. With this patch, the timestamp recording bits in tx_flags
    of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

    Please note that this is only effective for overriding the recording
    timestamps flags. Users should enable timestamp reporting (e.g.,
    SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
    socket options and then should ask for SOF_TIMESTAMPING_TX_*
    using control messages per sendmsg to sample timestamps for each
    write.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

29 Mar, 2016

1 commit


15 Mar, 2016

1 commit

  • The syzkaller fuzzer hit the following use-after-free:

    Call Trace:
    [] __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:295
    [] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
    [< inline >] SYSC_recvmmsg net/socket.c:2281
    [] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
    [] entry_SYSCALL_64_fastpath+0x16/0x7a
    arch/x86/entry/entry_64.S:185

    And, as Dmitry rightly assessed, that is because we can drop the
    reference and then touch it when the underlying recvmsg calls return
    some packets and then hit an error, which will make recvmmsg to set
    sock->sk->sk_err, oops, fix it.

    Reported-and-Tested-by: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Eric Dumazet
    Cc: Kostya Serebryany
    Cc: Sasha Levin
    Fixes: a2e2725541fa ("net: Introduce recvmmsg socket syscall")
    http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

14 Mar, 2016

1 commit


10 Mar, 2016

3 commits

  • Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
    indicate that more messages will follow (i.e. a batch of messages is
    being sent). This is similar to MSG_MORE except that the following
    messages are not merged into one packet, they are sent individually.
    sendmmsg is updated so that each contained message except for the
    last one is marked as MSG_BATCH.

    MSG_BATCH is a performance optimization in cases where a socket
    implementation can benefit by transmitting packets in a batch.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch allows setting MSG_EOR in each individual msghdr passed
    in sendmmsg. This allows a sendmmsg to send multiple messages when
    using SOCK_SEQPACKET.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Export it for cases where we want to create sockets by hand.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Jan, 2016

1 commit

  • Applications often have to reduce number of datagrams
    they receive or send per system call to avoid starvation problems.

    Really the kernel should take care of this by using cond_resched(),
    so that applications can experiment bigger batch sizes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Dec, 2015

1 commit

  • Commit ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") from
    the current 4.4 release cycle introduced a new flags member in
    struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
    from struct socket's flags member into that new place.

    Unfortunately, the new flags field is never initialized properly, at least
    not for the struct socket_wq instance created in sock_alloc_inode().

    One particular issue I encountered because of this is that my GNU Emacs
    failed to draw anything on my desktop -- i.e. what I got is a transparent
    window, including the title bar. Bisection lead to the commit mentioned
    above and further investigation by means of strace told me that Emacs
    is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
    reproducible 100% of times and the fact that properly initializing the
    struct socket_wq ->flags fixes the issue leads me to the conclusion that
    somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
    preventing my Emacs from receiving any SIGIO's due to data becoming
    available and it got stuck.

    Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
    member to zero.

    Fixes: ceb5d58b2170 ("net: fix sock_wake_async() rcu protection")
    Signed-off-by: Nicolai Stange
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nicolai Stange
     

16 Dec, 2015

1 commit


02 Dec, 2015

2 commits

  • Dmitry provided a syzkaller (http://github.com/google/syzkaller)
    triggering a fault in sock_wake_async() when async IO is requested.

    Said program stressed af_unix sockets, but the issue is generic
    and should be addressed in core networking stack.

    The problem is that by the time sock_wake_async() is called,
    we should not access the @flags field of 'struct socket',
    as the inode containing this socket might be freed without
    further notice, and without RCU grace period.

    We already maintain an RCU protected structure, "struct socket_wq"
    so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
    is the safe route.

    It also reduces number of cache lines needing dirtying, so might
    provide a performance improvement anyway.

    In followup patches, we might move remaining flags (SOCK_NOSPACE,
    SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
    being mostly read and let it being shared between cpus.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet