11 Nov, 2020

1 commit


13 Oct, 2020

1 commit

  • Pull compat iovec cleanups from Al Viro:
    "Christoph's series around import_iovec() and compat variant thereof"

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    security/keys: remove compat_keyctl_instantiate_key_iov
    mm: remove compat_process_vm_{readv,writev}
    fs: remove compat_sys_vmsplice
    fs: remove the compat readv/writev syscalls
    fs: remove various compat readv/writev helpers
    iov_iter: transparently handle compat iovecs in import_iovec
    iov_iter: refactor rw_copy_check_uvector and import_iovec
    iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
    compat.h: fix a spelling error in

    Linus Torvalds
     

03 Oct, 2020

1 commit


24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

08 Aug, 2020

1 commit

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     

16 Jun, 2020

1 commit

  • There is a regular need in the kernel to provide a way to declare having a
    dynamically sized set of trailing elements in a structure. Kernel code should
    always use “flexible array members”[1] for these cases. The older style of
    one-element or zero-length arrays should no longer be used[2].

    [1] https://en.wikipedia.org/wiki/Flexible_array_member
    [2] https://github.com/KSPP/linux/issues/21

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

11 Jun, 2020

1 commit

  • Patch series "improve use_mm / unuse_mm", v2.

    This series improves the use_mm / unuse_mm interface by better documenting
    the assumptions, and my taking the set_fs manipulations spread over the
    callers into the core API.

    This patch (of 3):

    Use the proper API instead.

    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

    These helpers are only for use with kernel threads, and I will tie them
    more into the kthread infrastructure going forward. Also move the
    prototypes to kthread.h - mmu_context.h was a little weird to start with
    as it otherwise contains very low-level MM bits.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

10 Jun, 2020

1 commit

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

14 May, 2020

1 commit

  • Avi Kivity reports that on fuse filesystems running in a user namespace
    asyncronous fsync fails with EOVERFLOW.

    The reason is that f_ops->fsync() is called with the creds of the kthread
    performing aio work instead of the creds of the process originally
    submitting IOCB_CMD_FSYNC.

    Fuse sends the creds of the caller in the request header and it needs to
    translate the uid and gid into the server's user namespace. Since the
    kthread is running in init_user_ns, the translation will fail and the
    operation returns an error.

    It can be argued that fsync doesn't actually need any creds, but just
    zeroing out those fields in the header (as with requests that currently
    don't take creds) is a backward compatibility risk.

    Instead of working around this issue in fuse, solve the core of the problem
    by calling the filesystem with the proper creds.

    Reported-by: Avi Kivity
    Tested-by: Giuseppe Scrivano
    Fixes: c9582eb0ff7d ("fuse: Fail all requests with invalid uids or gids")
    Cc: stable@vger.kernel.org # 4.18+
    Signed-off-by: Miklos Szeredi
    Reviewed-by: Christoph Hellwig

    Miklos Szeredi
     

04 Feb, 2020

1 commit

  • If we have nested or circular eventfd wakeups, then we can deadlock if
    we run them inline from our poll waitqueue wakeup handler. It's also
    possible to have very long chains of notifications, to the extent where
    we could risk blowing the stack.

    Check the eventfd recursion count before calling eventfd_signal(). If
    it's non-zero, then punt the signaling to async context. This is always
    safe, as it takes us out-of-line in terms of stack and locking context.

    Cc: stable@vger.kernel.org # 4.19+
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Dec, 2019

1 commit

  • Pull y2038 cleanups from Arnd Bergmann:
    "y2038 syscall implementation cleanups

    This is a series of cleanups for the y2038 work, mostly intended for
    namespace cleaning: the kernel defines the traditional time_t, timeval
    and timespec types that often lead to y2038-unsafe code. Even though
    the unsafe usage is mostly gone from the kernel, having the types and
    associated functions around means that we can still grow new users,
    and that we may be missing conversions to safe types that actually
    matter.

    There are still a number of driver specific patches needed to get the
    last users of these types removed, those have been submitted to the
    respective maintainers"

    Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

    * tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
    y2038: alarm: fix half-second cut-off
    y2038: ipc: fix x32 ABI breakage
    y2038: fix typo in powerpc vdso "LOPART"
    y2038: allow disabling time32 system calls
    y2038: itimer: change implementation to timespec64
    y2038: move itimer reset into itimer.c
    y2038: use compat_{get,set}_itimer on alpha
    y2038: itimer: compat handling to itimer.c
    y2038: time: avoid timespec usage in settimeofday()
    y2038: timerfd: Use timespec64 internally
    y2038: elfcore: Use __kernel_old_timeval for process times
    y2038: make ns_to_compat_timeval use __kernel_old_timeval
    y2038: socket: use __kernel_old_timespec instead of timespec
    y2038: socket: remove timespec reference in timestamping
    y2038: syscalls: change remaining timeval to __kernel_old_timeval
    y2038: rusage: use __kernel_old_timeval
    y2038: uapi: change __kernel_time_t to __kernel_old_time_t
    y2038: stat: avoid 'time_t' in 'struct stat'
    y2038: ipc: remove __kernel_time_t reference from headers
    y2038: vdso: powerpc: avoid timespec references
    ...

    Linus Torvalds
     

15 Nov, 2019

1 commit


22 Oct, 2019

1 commit

  • This type is used to pass the sigset_t from userland to the kernel,
    but it was using the kernel native pointer type for the member
    representing the compat userland pointer to the userland sigset_t.

    This messes up the layout, and makes the kernel eat up both the
    userland pointer and the size members into the kernel pointer, and
    then reads garbage into the kernel sigsetsize. Which makes the sigset_t
    size consistency check fail, and consequently the syscall always
    returns -EINVAL.

    This breaks both libaio and strace on 32-bit userland running on 64-bit
    kernels. And there are apparently no users in the wild of the current
    broken layout (at least according to codesearch.debian.org and a brief
    check over github.com search). So it looks safe to fix this directly
    in the kernel, instead of either letting userland deal with this
    permanently with the additional overhead or trying to make the syscall
    infer what layout userland used, even though this is also being worked
    around in libaio to temporarily cope with kernels that have not yet
    been fixed.

    We use a proper compat_uptr_t instead of a compat_sigset_t pointer.

    Fixes: 7a074e96dee6 ("aio: implement io_pgetevents")
    Signed-off-by: Guillem Jover
    Signed-off-by: Al Viro

    Guillem Jover
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • migrate_page_move_mapping() doesn't use the mode argument. Remove it
    and update callers accordingly.

    Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
    Signed-off-by: Keith Busch
    Reviewed-by: Zi Yan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Busch
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Jul, 2019

1 commit

  • Pull io_uring updates from Jens Axboe:
    "This contains:

    - Support for recvmsg/sendmsg as first class opcodes.

    I don't envision going much further down this path, as there are
    plans in progress to support potentially any system call in an
    async fashion through io_uring. But I think it does make sense to
    have certain core ops available directly, especially those that can
    support a "try this non-blocking" flag/mode. (me)

    - Handle generic short reads automatically.

    This can happen fairly easily if parts of the buffered read is
    cached. Since the application needs to issue another request for
    the remainder, just do this internally and save kernel/user
    roundtrip while providing a nicer more robust API. (me)

    - Support for linked SQEs.

    This allows SQEs to depend on each other, enabling an application
    to eg queue a read-from-this-file,write-to-that-file pair. (me)

    - Fix race in stopping SQ thread (Jackie)"

    * tag 'for-5.3/io_uring-20190711' of git://git.kernel.dk/linux-block:
    io_uring: fix io_sq_thread_stop running in front of io_sq_thread
    io_uring: add support for recvmsg()
    io_uring: add support for sendmsg()
    io_uring: add support for sqe links
    io_uring: punt short reads to async context
    uio: make import_iovec()/compat_import_iovec() return bytes on success

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Jun, 2019

1 commit


26 May, 2019

2 commits

  • Convert the aio filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Benjamin LaHaise
    cc: linux-aio@kvack.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     

05 Apr, 2019

1 commit


04 Apr, 2019

1 commit


18 Mar, 2019

9 commits

  • makes for somewhat cleaner control flow in __io_submit_one()

    Signed-off-by: Al Viro

    Al Viro
     
  • simplifies the caller

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • no reason to duplicate that...

    Signed-off-by: Al Viro

    Al Viro
     
  • that ssize_t is a rudiment of earlier calling conventions; it's been
    used only to pass 0 and -E... since last autumn.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • aio_poll() has to cope with several unpleasant problems:
    * requests that might stay around indefinitely need to
    be made visible for io_cancel(2); that must not be done to
    a request already completed, though.
    * in cases when ->poll() has placed us on a waitqueue,
    wakeup might have happened (and request completed) before ->poll()
    returns.
    * worse, in some early wakeup cases request might end
    up re-added into the queue later - we can't treat "woken up and
    currently not in the queue" as "it's not going to stick around
    indefinitely"
    * ... moreover, ->poll() might have decided not to
    put it on any queues to start with, and that needs to be distinguished
    from the previous case
    * ->poll() might have tried to put us on more than one queue.
    Only the first will succeed for aio poll, so we might end up missing
    wakeups. OTOH, we might very well notice that only after the
    wakeup hits and request gets completed (all before ->poll() gets
    around to the second poll_wait()). In that case it's too late to
    decide that we have an error.

    req->woken was an attempt to deal with that. Unfortunately, it was
    broken. What we need to keep track of is not that wakeup has happened -
    the thing might come back after that. It's that async reference is
    already gone and won't come back, so we can't (and needn't) put the
    request on the list of cancellables.

    The easiest case is "request hadn't been put on any waitqueues"; we
    can tell by seeing NULL apt.head, and in that case there won't be
    anything async. We should either complete the request ourselves
    (if vfs_poll() reports anything of interest) or return an error.

    In all other cases we get exclusion with wakeups by grabbing the
    queue lock.

    If request is currently on queue and we have something interesting
    from vfs_poll(), we can steal it and complete the request ourselves.

    If it's on queue and vfs_poll() has not reported anything interesting,
    we either put it on the cancellable list, or, if we know that it
    hadn't been put on all queues ->poll() wanted it on, we steal it and
    return an error.

    If it's _not_ on queue, it's either been already dealt with (in which
    case we do nothing), or there's aio_poll_complete_work() about to be
    executed. In that case we either put it on the cancellable list,
    or, if we know it hadn't been put on all queues ->poll() wanted it on,
    simulate what cancel would've done.

    It's a lot more convoluted than I'd like it to be. Single-consumer APIs
    suck, and unfortunately aio is not an exception...

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of having aio_complete() set ->ki_res.{res,res2}, do that
    explicitly in its callers, drop the reference (as aio_complete()
    used to do) and delay the rest until the final iocb_put().

    Signed-off-by: Al Viro

    Al Viro
     
  • We want to separate forming the resulting io_event from putting it
    into the ring buffer.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • aio_poll() is not the only case that needs file pinned; worse, while
    aio_read()/aio_write() can live without pinning iocb itself, the
    proof is rather brittle and can easily break on later changes.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Pull year 2038 updates from Thomas Gleixner:
    "Another round of changes to make the kernel ready for 2038. After lots
    of preparatory work this is the first set of syscalls which are 2038
    safe:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    The syscall numbers are identical all over the architectures"

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    riscv: Use latest system call ABI
    checksyscalls: fix up mq_timedreceive and stat exceptions
    unicore32: Fix __ARCH_WANT_STAT64 definition
    asm-generic: Make time32 syscall numbers optional
    asm-generic: Drop getrlimit and setrlimit syscalls from default list
    32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
    compat ABI: use non-compat openat and open_by_handle_at variants
    y2038: add 64-bit time_t syscalls to all 32-bit architectures
    y2038: rename old time and utime syscalls
    y2038: remove struct definition redirects
    y2038: use time32 syscall names on 32-bit
    syscalls: remove obsolete __IGNORE_ macros
    y2038: syscalls: rename y2038 compat syscalls
    x86/x32: use time64 versions of sigtimedwait and recvmmsg
    timex: change syscalls to use struct __kernel_timex
    timex: use __kernel_timex internally
    sparc64: add custom adjtimex/clock_adjtime functions
    time: fix sys_timer_settime prototype
    time: Add struct __kernel_timex
    time: make adjtime compat handling available for 32 bit
    ...

    Linus Torvalds
     

05 Mar, 2019

2 commits

  • Pull vfs fixes from Al Viro:
    "Assorted fixes that sat in -next for a while, all over the place"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: Fix locking in aio_poll()
    exec: Fix mem leak in kernel_read_file
    copy_mount_string: Limit string length to PATH_MAX
    cgroup: saner refcounting for cgroup_root
    fix cgroup_do_mount() handling of failure exits

    Linus Torvalds
     
  • Al Viro root-caused a race where the IOCB_CMD_POLL handling of
    fget/fput() could cause us to access the file pointer after it had
    already been freed:

    "In more details - normally IOCB_CMD_POLL handling looks so:

    1) io_submit(2) allocates aio_kiocb instance and passes it to
    aio_poll()

    2) aio_poll() resolves the descriptor to struct file by req->file =
    fget(iocb->aio_fildes)

    3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
    aio_kiocb to 2 (bumps by 1, that is).

    4) aio_poll() calls vfs_poll(). After sanity checks (basically,
    "poll_wait() had been called and only once") it locks the queue.
    That's what the extra reference to iocb had been for - we know we
    can safely access it.

    5) With queue locked, we check if ->woken has already been set to
    true (by aio_poll_wake()) and, if it had been, we unlock the
    queue, drop a reference to aio_kiocb and bugger off - at that
    point it's a responsibility to aio_poll_wake() and the stuff
    called/scheduled by it. That code will drop the reference to file
    in req->file, along with the other reference to our aio_kiocb.

    6) otherwise, we see whether we need to wait. If we do, we unlock the
    queue, drop one reference to aio_kiocb and go away - eventual
    wakeup (or cancel) will deal with the reference to file and with
    the other reference to aio_kiocb

    7) otherwise we remove ourselves from waitqueue (still under the
    queue lock), so that wakeup won't get us. No async activity will
    be happening, so we can safely drop req->file and iocb ourselves.

    If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
    won't get freed under us, so we can do all the checks and locking
    safely. And we don't touch ->file if we detect that case.

    However, vfs_poll() most certainly *does* touch the file it had been
    given. So wakeup coming while we are still in ->poll() might end up
    doing fput() on that file. That case is not too rare, and usually we
    are saved by the still present reference from descriptor table - that
    fput() is not the final one.

    But if another thread closes that descriptor right after our fget()
    and wakeup does happen before ->poll() returns, we are in trouble -
    final fput() done while we are in the middle of a method:

    Al also wrote a patch to take an extra reference to the file descriptor
    to fix this, but I instead suggested we just streamline the whole file
    pointer handling by submit_io() so that the generic aio submission code
    simply keeps the file pointer around until the aio has completed.

    Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
    Acked-by: Al Viro
    Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Feb, 2019

1 commit

  • wake_up_locked() may but does not have to be called with interrupts
    disabled. Since the fuse filesystem calls wake_up_locked() without
    disabling interrupts aio_poll_wake() may be called with interrupts
    enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
    all code that acquires that lock from thread context must disable
    interrupts. Hence change the spin_trylock() call in aio_poll_wake()
    into a spin_trylock_irqsave() call. This patch fixes the following
    lockdep complaint:

    =====================================================
    WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
    5.0.0-rc4-next-20190131 #23 Not tainted
    -----------------------------------------------------
    syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

    and this task is already holding:
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
    which would create a new lock dependency:
    (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (&(&ctx->ctx_lock)->rlock){..-.}

    ... which became SOFTIRQ-irq-safe at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

    to a SOFTIRQ-irq-unsafe lock:
    (&fiq->waitq){+.+.}

    ... which became SOFTIRQ-irq-unsafe at:
    ...
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fiq->waitq);
    local_irq_disable();
    lock(&(&ctx->ctx_lock)->rlock);
    lock(&fiq->waitq);

    lock(&(&ctx->ctx_lock)->rlock);

    *** DEADLOCK ***

    1 lock held by syz-executor2/13779:
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

    the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
    -> (&(&ctx->ctx_lock)->rlock){..-.} {
    IN-SOFTIRQ-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    __do_sys_io_cancel fs/aio.c:2052 [inline]
    __se_sys_io_cancel fs/aio.c:2035 [inline]
    __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.52370+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    the dependencies between the lock to be acquired
    and SOFTIRQ-irq-unsafe lock:
    -> (&fiq->waitq){+.+.} {
    HARDIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    SOFTIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.43450+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    stack backtrace:
    CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
    check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
    check_irq_usage kernel/locking/lockdep.c:1650 [inline]
    check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
    check_prev_add kernel/locking/lockdep.c:1860 [inline]
    check_prevs_add kernel/locking/lockdep.c:1968 [inline]
    validate_chain kernel/locking/lockdep.c:2339 [inline]
    __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported-by: syzbot
    Cc: Christoph Hellwig
    Cc: Avi Kivity
    Cc: Miklos Szeredi
    Cc:
    Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
    Signed-off-by: Miklos Szeredi
    [ bvanassche: added a comment ]
    Reluctantly-Acked-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Al Viro

    Bart Van Assche
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

06 Feb, 2019

1 commit


29 Dec, 2018

2 commits

  • Merge misc updates from Andrew Morton:

    - large KASAN update to use arm's "software tag-based mode"

    - a few misc things

    - sh updates

    - ocfs2 updates

    - just about all of MM

    * emailed patches from Andrew Morton : (167 commits)
    kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
    memcg, oom: notify on oom killer invocation from the charge path
    mm, swap: fix swapoff with KSM pages
    include/linux/gfp.h: fix typo
    mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
    hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    memory_hotplug: add missing newlines to debugging output
    mm: remove __hugepage_set_anon_rmap()
    include/linux/vmstat.h: remove unused page state adjustment macro
    mm/page_alloc.c: allow error injection
    mm: migrate: drop unused argument of migrate_page_move_mapping()
    blkdev: avoid migration stalls for blkdev pages
    mm: migrate: provide buffer_migrate_page_norefs()
    mm: migrate: move migrate_page_lock_buffers()
    mm: migrate: lock buffers before migrate_page_move_mapping()
    mm: migration: factor out code to compute expected number of page references
    mm, page_alloc: enable pcpu_drain with zone capability
    kmemleak: add config to select auto scan
    mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
    ...

    Linus Torvalds
     
  • Pull aio updates from Jens Axboe:
    "Flushing out pre-patches for the buffered/polled aio series. Some
    fixes in here, but also optimizations"

    * tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block:
    aio: abstract out io_event filler helper
    aio: split out iocb copy from io_submit_one()
    aio: use iocb_put() instead of open coding it
    aio: only use blk plugs for > 2 depth submissions
    aio: don't zero entire aio_kiocb aio_get_req()
    aio: separate out ring reservation from req allocation
    aio: use assigned completion handler

    Linus Torvalds