11 Aug, 2020

1 commit

  • This reverts commits 6d04fe15f78acdf8e32329e208552e226f7a8ae6 and
    a31edb2059ed4e498f9aa8230c734b59d0ad797a.

    It turns out the idea to share a single pointer for both kernel and user
    space address causes various kinds of problems. So use the slightly less
    optimal version that uses an extra bit, but which is guaranteed to be safe
    everywhere.

    Fixes: 6d04fe15f78a ("net: optimize the sockptr_t for unified kernel/user address spaces")
    Reported-by: Eric Dumazet
    Reported-by: John Stultz
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

09 Aug, 2020

4 commits


29 Jul, 2020

1 commit

  • Make sure not just the pointer itself but the whole range lies in
    the user address space. For that pass the length and then use
    the access_ok helper to do the check.

    Fixes: 6d04fe15f78a ("net: optimize the sockptr_t for unified kernel/user address spaces")
    Reported-by: David Laight
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

25 Jul, 2020

3 commits

  • For architectures like x86 and arm64 we don't need the separate bit to
    indicate that a pointer is a kernel pointer as the address spaces are
    unified. That way the sockptr_t can be reduced to a union of two
    pointers, which leads to nicer calling conventions.

    The only caveat is that we need to check that users don't pass in kernel
    address and thus gain access to kernel memory. Thus the USER_SOCKPTR
    helper is replaced with a init_user_sockptr function that does this check
    and returns an error if it fails.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • Rework the remaining setsockopt code to pass a sockptr_t instead of a
    plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
    outside of architecture specific code.

    Signed-off-by: Christoph Hellwig
    Acked-by: Stefan Schmidt [ieee802154]
    Acked-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • Pass a sockptr_t to prepare for set_fs-less handling of the kernel
    pointer from bpf-cgroup.

    Signed-off-by: Christoph Hellwig
    Acked-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

20 Jul, 2020

4 commits


14 Jul, 2020

1 commit


05 Jul, 2020

1 commit

  • setsockopt(mptcp_fd, SOL_SOCKET, ...)... appears to work (returns 0),
    but it has no effect -- this is because the MPTCP layer never has a
    chance to copy the settings to the subflow socket.

    Skip the generic handling for the mptcp case and instead call the
    mptcp specific handler instead for SOL_SOCKET too.

    Next patch adds more specific handling for SOL_SOCKET to mptcp.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 May, 2020

1 commit


28 May, 2020

1 commit


19 May, 2020

2 commits


12 May, 2020

1 commit

  • The msg_control field in struct msghdr can either contain a user
    pointer when used with the recvmsg system call, or a kernel pointer
    when used with sendmsg. To complicate things further kernel_recvmsg
    can stuff a kernel pointer in and then use set_fs to make the uaccess
    helpers accept it.

    Replace it with a union of a kernel pointer msg_control field, and
    a user pointer msg_control_user one, and allow kernel_recvmsg operate
    on a proper kernel pointer using a bitfield to override the normal
    choice of a user pointer for recvmsg.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

31 Mar, 2020

1 commit

  • Pull io_uring updates from Jens Axboe:
    "Here are the io_uring changes for this merge window. Light on new
    features this time around (just splice + buffer selection), lots of
    cleanups, fixes, and improvements to existing support. In particular,
    this contains:

    - Cleanup fixed file update handling for stack fallback (Hillf)

    - Re-work of how pollable async IO is handled, we no longer require
    thread offload to handle that. Instead we rely using poll to drive
    this, with task_work execution.

    - In conjunction with the above, allow expendable buffer selection,
    so that poll+recv (for example) no longer has to be a split
    operation.

    - Make sure we honor RLIMIT_FSIZE for buffered writes

    - Add support for splice (Pavel)

    - Linked work inheritance fixes and optimizations (Pavel)

    - Async work fixes and cleanups (Pavel)

    - Improve io-wq locking (Pavel)

    - Hashed link write improvements (Pavel)

    - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

    * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
    io_uring: cleanup io_alloc_async_ctx()
    io_uring: fix missing 'return' in comment
    io-wq: handle hashed writes in chains
    io-uring: drop 'free_pfile' in struct io_file_put
    io-uring: drop completion when removing file
    io_uring: Fix ->data corruption on re-enqueue
    io-wq: close cancel gap for hashed linked work
    io_uring: make spdxcheck.py happy
    io_uring: honor original task RLIMIT_FSIZE
    io-wq: hash dependent work
    io-wq: split hashing and enqueueing
    io-wq: don't resched if there is no work
    io-wq: remove duplicated cancel code
    io_uring: fix truncated async read/readv and write/writev retry
    io_uring: dual license io_uring.h uapi header
    io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
    io_uring: Fix unused function warnings
    io_uring: add end-of-bits marker and build time verify it
    io_uring: provide means of removing buffers
    io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
    ...

    Linus Torvalds
     

20 Mar, 2020

1 commit

  • Just like commit 4022e7af86be, this fixes the fact that
    IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
    current->signal->rlim[] for limits.

    Add an extra argument to __sys_accept4_file() that allows us to pass
    in the proper nofile limit, and grab it at request prep time.

    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Mar, 2020

1 commit


09 Jan, 2020

1 commit

  • When procfs is disabled, the fdinfo code causes a harmless
    warning:

    net/socket.c:1000:13: error: 'sock_show_fdinfo' defined but not used [-Werror=unused-function]
    static void sock_show_fdinfo(struct seq_file *m, struct file *f)

    Move the function definition up so we can use a single #ifdef
    around it.

    Fixes: b4653342b151 ("net: Allow to show socket-specific information in /proc/[pid]/fdinfo/[fd]")
    Suggested-by: Al Viro
    Acked-by: Kirill Tkhai
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

23 Dec, 2019

1 commit


14 Dec, 2019

1 commit

  • Pull io_uring fixes from Jens Axboe:

    - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
    don't sever if the result is < 0.

    This is mostly for linked timeouts, where if we ask for a pure
    timeout we always get -ETIME. This makes links useless for that case,
    hence allow a case where it works.

    - Five minor optimizations to fix and improve cases that regressed
    since v5.4.

    - An SQTHREAD locking fix.

    - A sendmsg/recvmsg iov assignment fix.

    - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
    subsequently ensuring that works for io_uring.

    - Fix a case where for an invalid opcode we might return -EBADF instead
    of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.

    * tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
    io_uring: ensure we return -EINVAL on unknown opcode
    io_uring: add sockets to list of files that support non-blocking issue
    net: make socket read/write_iter() honor IOCB_NOWAIT
    io_uring: only hash regular files for async work execution
    io_uring: run next sqe inline if possible
    io_uring: don't dynamically allocate poll data
    io_uring: deferred send/recvmsg should assign iov
    io_uring: sqthread should grab ctx->uring_lock for submissions
    io-wq: briefly spin for new work after finishing work
    io-wq: remove worker->wait waitqueue
    io_uring: allow unbreakable links

    Linus Torvalds
     

13 Dec, 2019

1 commit


11 Dec, 2019

1 commit

  • The socket read/write helpers only look at the file O_NONBLOCK. not
    the iocb IOCB_NOWAIT flag. This breaks users like preadv2/pwritev2
    and io_uring that rely on not having the file itself marked nonblocking,
    but rather the iocb itself.

    Cc: netdev@vger.kernel.org
    Acked-by: David Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Dec, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) More jumbo frame fixes in r8169, from Heiner Kallweit.

    2) Fix bpf build in minimal configuration, from Alexei Starovoitov.

    3) Use after free in slcan driver, from Jouni Hogander.

    4) Flower classifier port ranges don't work properly in the HW offload
    case, from Yoshiki Komachi.

    5) Use after free in hns3_nic_maybe_stop_tx(), from Yunsheng Lin.

    6) Out of bounds access in mqprio_dump(), from Vladyslav Tarasiuk.

    7) Fix flow dissection in dsa TX path, from Alexander Lobakin.

    8) Stale syncookie timestampe fixes from Guillaume Nault.

    [ Did an evil merge to silence a warning introduced by this pull - Linus ]

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
    r8169: fix rtl_hw_jumbo_disable for RTL8168evl
    net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add()
    r8169: add missing RX enabling for WoL on RTL8125
    vhost/vsock: accept only packets with the right dst_cid
    net: phy: dp83867: fix hfs boot in rgmii mode
    net: ethernet: ti: cpsw: fix extra rx interrupt
    inet: protect against too small mtu values.
    gre: refetch erspan header from skb->data after pskb_may_pull()
    pppoe: remove redundant BUG_ON() check in pppoe_pernet
    tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
    tcp: tighten acceptance of ACKs not matching a child socket
    tcp: fix rejected syncookies due to stale timestamps
    lpc_eth: kernel BUG on remove
    tcp: md5: fix potential overestimation of TCP option space
    net: sched: allow indirect blocks to bind to clsact in TC
    net: core: rename indirect block ingress cb function
    net-sysfs: Call dev_hold always in netdev_queue_add_kobject
    net: dsa: fix flow dissection on Tx path
    net/tls: Fix return values to avoid ENOTSUPP
    net: avoid an indirect call in ____sys_recvmsg()
    ...

    Linus Torvalds
     

07 Dec, 2019

1 commit

  • CONFIG_RETPOLINE=y made indirect calls expensive.

    gcc seems to add an indirect call in ____sys_recvmsg().

    Rewriting the code slightly makes sure to avoid this indirection.

    Alternative would be to not call sock_recvmsg() and instead
    use security_socket_recvmsg() and sock_recvmsg_nosec(),
    but this is less readable IMO.

    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Cc: David Laight
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2019

2 commits


02 Dec, 2019

2 commits

  • Pull y2038 cleanups from Arnd Bergmann:
    "y2038 syscall implementation cleanups

    This is a series of cleanups for the y2038 work, mostly intended for
    namespace cleaning: the kernel defines the traditional time_t, timeval
    and timespec types that often lead to y2038-unsafe code. Even though
    the unsafe usage is mostly gone from the kernel, having the types and
    associated functions around means that we can still grow new users,
    and that we may be missing conversions to safe types that actually
    matter.

    There are still a number of driver specific patches needed to get the
    last users of these types removed, those have been submitted to the
    respective maintainers"

    Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

    * tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
    y2038: alarm: fix half-second cut-off
    y2038: ipc: fix x32 ABI breakage
    y2038: fix typo in powerpc vdso "LOPART"
    y2038: allow disabling time32 system calls
    y2038: itimer: change implementation to timespec64
    y2038: move itimer reset into itimer.c
    y2038: use compat_{get,set}_itimer on alpha
    y2038: itimer: compat handling to itimer.c
    y2038: time: avoid timespec usage in settimeofday()
    y2038: timerfd: Use timespec64 internally
    y2038: elfcore: Use __kernel_old_timeval for process times
    y2038: make ns_to_compat_timeval use __kernel_old_timeval
    y2038: socket: use __kernel_old_timespec instead of timespec
    y2038: socket: remove timespec reference in timestamping
    y2038: syscalls: change remaining timeval to __kernel_old_timeval
    y2038: rusage: use __kernel_old_timeval
    y2038: uapi: change __kernel_time_t to __kernel_old_time_t
    y2038: stat: avoid 'time_t' in 'struct stat'
    y2038: ipc: remove __kernel_time_t reference from headers
    y2038: vdso: powerpc: avoid timespec references
    ...

    Linus Torvalds
     
  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     

27 Nov, 2019

2 commits


26 Nov, 2019

3 commits

  • This is identical to __sys_connect(), except it takes a struct file
    instead of an fd, and it also allows passing in extra file->f_flags
    flags. The latter is done to support masking in O_NONBLOCK without
    manipulating the original file flags.

    No functional changes in this patch.

    Cc: netdev@vger.kernel.org
    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull io_uring updates from Jens Axboe:
    "A lot of stuff has been going on this cycle, with improving the
    support for networked IO (and hence unbounded request completion
    times) being one of the major themes. There's been a set of fixes done
    this week, I'll send those out as well once we're certain we're fully
    happy with them.

    This contains:

    - Unification of the "normal" submit path and the SQPOLL path (Pavel)

    - Support for sparse (and bigger) file sets, and updating of those
    file sets without needing to unregister/register again.

    - Independently sized CQ ring, instead of just making it always 2x
    the SQ ring size. This makes it more flexible for networked
    applications.

    - Support for overflowed CQ ring, never dropping events but providing
    backpressure on submits.

    - Add support for absolute timeouts, not just relative ones.

    - Support for generic cancellations. This divorces io_uring from
    workqueues as well, which additionally gets us one step closer to
    generic async system call support.

    - With cancellations, we can support grabbing the process file table
    as well, just like we do mm context. This allows support for system
    calls that create file descriptors, like accept4() support that's
    built on top of that.

    - Support for io_uring tracing (Dmitrii)

    - Support for linked timeouts. These abort an operation if it isn't
    completed by the time noted in the linke timeout.

    - Speedup tracking of poll requests

    - Various cleanups making the coder easier to follow (Jackie, Pavel,
    Bob, YueHaibing, me)

    - Update MAINTAINERS with new io_uring list"

    * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
    io_uring: make POLL_ADD/POLL_REMOVE scale better
    io-wq: remove now redundant struct io_wq_nulls_list
    io_uring: Fix getting file for non-fd opcodes
    io_uring: introduce req_need_defer()
    io_uring: clean up io_uring_cancel_files()
    io-wq: ensure free/busy list browsing see all items
    io-wq: ensure we have a stable view of ->cur_work for cancellations
    io_wq: add get/put_work handlers to io_wq_create()
    io_uring: check for validity of ->rings in teardown
    io_uring: fix potential deadlock in io_poll_wake()
    io_uring: use correct "is IO worker" helper
    io_uring: fix -ENOENT issue with linked timer with short timeout
    io_uring: don't do flush cancel under inflight_lock
    io_uring: flag SQPOLL busy condition to userspace
    io_uring: make ASYNC_CANCEL work with poll and timeout
    io_uring: provide fallback request for OOM situations
    io_uring: convert accept4() -ERESTARTSYS into -EINTR
    io_uring: fix error clear of ->file_table in io_sqe_files_register()
    io_uring: separate the io_free_req and io_free_req_find_next interface
    io_uring: keep io_put_req only responsible for release and put req
    ...

    Linus Torvalds
     
  • In commit 3975b097e577 ("convert stream-like files -> stream_open, even
    if they use noop_llseek") Kirill used a coccinelle script to change
    "nonseekable_open()" to "stream_open()", which changed the trivial cases
    of stream-like file descriptors to the new model with FMODE_STREAM.

    However, the two big cases - sockets and pipes - don't actually have
    that trivial pattern at all, and were thus never converted to
    FMODE_STREAM even though it makes lots of sense to do so.

    That's particularly true when looking forward to the next change:
    getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
    decide whether f_pos updates are needed or not. And if they are, we'll
    always do them atomically.

    This came up because KCSAN (correctly) noted that the non-locked f_pos
    updates are data races: they are clearly benign for the case where we
    don't care, but it would be good to just not have that issue exist at
    all.

    Note that the reason we used FMODE_ATOMIC_POS originally is that only
    doing it for the minimal required case is "safer" in that it's possible
    that the f_pos locking can cause unnecessary serialization across the
    whole write() call. And in the worst case, that kind of serialization
    can cause deadlock issues: think writers that need readers to empty the
    state using the same file descriptor.

    [ Note that the locking is per-file descriptor - because it protects
    "f_pos", which is obviously per-file descriptor - so it only affects
    cases where you literally use the same file descriptor to both read
    and write.

    So a regular pipe that has separate reading and writing file
    descriptors doesn't really have this situation even though it's the
    obvious case of "reader empties what a bit writer concurrently fills"

    But we want to make pipes as being stream-line anyway, because we
    don't want the unnecessary overhead of locking, and because a named
    pipe can be (ab-)used by reading and writing to the same file
    descriptor. ]

    There are likely a lot of other cases that might want FMODE_STREAM, and
    looking for ".llseek = no_llseek" users and other cases that don't have
    an lseek file operation at all and making them use "stream_open()" might
    be a good idea. But pipes and sockets are likely to be the two main
    cases.

    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Nov, 2019

1 commit

  • The 'timespec' type definition and helpers like ktime_to_timespec()
    or timespec64_to_timespec() should no longer be used in the kernel so
    we can remove them and avoid introducing y2038 issues in new code.

    Change the socket code that needs to pass a timespec to user space for
    backward compatibility to use __kernel_old_timespec instead. This type
    has the same layout but with a clearer defined name.

    Slightly reformat tcp_recv_timestamp() for consistency after the removal
    of timespec64_to_timespec().

    Acked-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann