17 Jun, 2020

1 commit

  • commit 530f32fc370fd1431ea9802dbc53ab5601dfccdb upstream.

    Avi Kivity reports that on fuse filesystems running in a user namespace
    asyncronous fsync fails with EOVERFLOW.

    The reason is that f_ops->fsync() is called with the creds of the kthread
    performing aio work instead of the creds of the process originally
    submitting IOCB_CMD_FSYNC.

    Fuse sends the creds of the caller in the request header and it needs to
    translate the uid and gid into the server's user namespace. Since the
    kthread is running in init_user_ns, the translation will fail and the
    operation returns an error.

    It can be argued that fsync doesn't actually need any creds, but just
    zeroing out those fields in the header (as with requests that currently
    don't take creds) is a backward compatibility risk.

    Instead of working around this issue in fuse, solve the core of the problem
    by calling the filesystem with the proper creds.

    Reported-by: Avi Kivity
    Tested-by: Giuseppe Scrivano
    Fixes: c9582eb0ff7d ("fuse: Fail all requests with invalid uids or gids")
    Cc: stable@vger.kernel.org # 4.18+
    Signed-off-by: Miklos Szeredi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     

11 Feb, 2020

1 commit

  • commit 01d7a356872eec22ef34a33a5f9cfa917d145468 upstream.

    If we have nested or circular eventfd wakeups, then we can deadlock if
    we run them inline from our poll waitqueue wakeup handler. It's also
    possible to have very long chains of notifications, to the extent where
    we could risk blowing the stack.

    Check the eventfd recursion count before calling eventfd_signal(). If
    it's non-zero, then punt the signaling to async context. This is always
    safe, as it takes us out-of-line in terms of stack and locking context.

    Cc: stable@vger.kernel.org # 4.19+
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

22 Oct, 2019

1 commit

  • This type is used to pass the sigset_t from userland to the kernel,
    but it was using the kernel native pointer type for the member
    representing the compat userland pointer to the userland sigset_t.

    This messes up the layout, and makes the kernel eat up both the
    userland pointer and the size members into the kernel pointer, and
    then reads garbage into the kernel sigsetsize. Which makes the sigset_t
    size consistency check fail, and consequently the syscall always
    returns -EINVAL.

    This breaks both libaio and strace on 32-bit userland running on 64-bit
    kernels. And there are apparently no users in the wild of the current
    broken layout (at least according to codesearch.debian.org and a brief
    check over github.com search). So it looks safe to fix this directly
    in the kernel, instead of either letting userland deal with this
    permanently with the additional overhead or trying to make the syscall
    infer what layout userland used, even though this is also being worked
    around in libaio to temporarily cope with kernels that have not yet
    been fixed.

    We use a proper compat_uptr_t instead of a compat_sigset_t pointer.

    Fixes: 7a074e96dee6 ("aio: implement io_pgetevents")
    Signed-off-by: Guillem Jover
    Signed-off-by: Al Viro

    Guillem Jover
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • migrate_page_move_mapping() doesn't use the mode argument. Remove it
    and update callers accordingly.

    Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
    Signed-off-by: Keith Busch
    Reviewed-by: Zi Yan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Busch
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Jul, 2019

1 commit

  • Pull io_uring updates from Jens Axboe:
    "This contains:

    - Support for recvmsg/sendmsg as first class opcodes.

    I don't envision going much further down this path, as there are
    plans in progress to support potentially any system call in an
    async fashion through io_uring. But I think it does make sense to
    have certain core ops available directly, especially those that can
    support a "try this non-blocking" flag/mode. (me)

    - Handle generic short reads automatically.

    This can happen fairly easily if parts of the buffered read is
    cached. Since the application needs to issue another request for
    the remainder, just do this internally and save kernel/user
    roundtrip while providing a nicer more robust API. (me)

    - Support for linked SQEs.

    This allows SQEs to depend on each other, enabling an application
    to eg queue a read-from-this-file,write-to-that-file pair. (me)

    - Fix race in stopping SQ thread (Jackie)"

    * tag 'for-5.3/io_uring-20190711' of git://git.kernel.dk/linux-block:
    io_uring: fix io_sq_thread_stop running in front of io_sq_thread
    io_uring: add support for recvmsg()
    io_uring: add support for sendmsg()
    io_uring: add support for sqe links
    io_uring: punt short reads to async context
    uio: make import_iovec()/compat_import_iovec() return bytes on success

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Jun, 2019

1 commit


26 May, 2019

2 commits

  • Convert the aio filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Benjamin LaHaise
    cc: linux-aio@kvack.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     

05 Apr, 2019

1 commit


04 Apr, 2019

1 commit


18 Mar, 2019

9 commits

  • makes for somewhat cleaner control flow in __io_submit_one()

    Signed-off-by: Al Viro

    Al Viro
     
  • simplifies the caller

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • no reason to duplicate that...

    Signed-off-by: Al Viro

    Al Viro
     
  • that ssize_t is a rudiment of earlier calling conventions; it's been
    used only to pass 0 and -E... since last autumn.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • aio_poll() has to cope with several unpleasant problems:
    * requests that might stay around indefinitely need to
    be made visible for io_cancel(2); that must not be done to
    a request already completed, though.
    * in cases when ->poll() has placed us on a waitqueue,
    wakeup might have happened (and request completed) before ->poll()
    returns.
    * worse, in some early wakeup cases request might end
    up re-added into the queue later - we can't treat "woken up and
    currently not in the queue" as "it's not going to stick around
    indefinitely"
    * ... moreover, ->poll() might have decided not to
    put it on any queues to start with, and that needs to be distinguished
    from the previous case
    * ->poll() might have tried to put us on more than one queue.
    Only the first will succeed for aio poll, so we might end up missing
    wakeups. OTOH, we might very well notice that only after the
    wakeup hits and request gets completed (all before ->poll() gets
    around to the second poll_wait()). In that case it's too late to
    decide that we have an error.

    req->woken was an attempt to deal with that. Unfortunately, it was
    broken. What we need to keep track of is not that wakeup has happened -
    the thing might come back after that. It's that async reference is
    already gone and won't come back, so we can't (and needn't) put the
    request on the list of cancellables.

    The easiest case is "request hadn't been put on any waitqueues"; we
    can tell by seeing NULL apt.head, and in that case there won't be
    anything async. We should either complete the request ourselves
    (if vfs_poll() reports anything of interest) or return an error.

    In all other cases we get exclusion with wakeups by grabbing the
    queue lock.

    If request is currently on queue and we have something interesting
    from vfs_poll(), we can steal it and complete the request ourselves.

    If it's on queue and vfs_poll() has not reported anything interesting,
    we either put it on the cancellable list, or, if we know that it
    hadn't been put on all queues ->poll() wanted it on, we steal it and
    return an error.

    If it's _not_ on queue, it's either been already dealt with (in which
    case we do nothing), or there's aio_poll_complete_work() about to be
    executed. In that case we either put it on the cancellable list,
    or, if we know it hadn't been put on all queues ->poll() wanted it on,
    simulate what cancel would've done.

    It's a lot more convoluted than I'd like it to be. Single-consumer APIs
    suck, and unfortunately aio is not an exception...

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of having aio_complete() set ->ki_res.{res,res2}, do that
    explicitly in its callers, drop the reference (as aio_complete()
    used to do) and delay the rest until the final iocb_put().

    Signed-off-by: Al Viro

    Al Viro
     
  • We want to separate forming the resulting io_event from putting it
    into the ring buffer.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • aio_poll() is not the only case that needs file pinned; worse, while
    aio_read()/aio_write() can live without pinning iocb itself, the
    proof is rather brittle and can easily break on later changes.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Pull year 2038 updates from Thomas Gleixner:
    "Another round of changes to make the kernel ready for 2038. After lots
    of preparatory work this is the first set of syscalls which are 2038
    safe:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    The syscall numbers are identical all over the architectures"

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    riscv: Use latest system call ABI
    checksyscalls: fix up mq_timedreceive and stat exceptions
    unicore32: Fix __ARCH_WANT_STAT64 definition
    asm-generic: Make time32 syscall numbers optional
    asm-generic: Drop getrlimit and setrlimit syscalls from default list
    32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
    compat ABI: use non-compat openat and open_by_handle_at variants
    y2038: add 64-bit time_t syscalls to all 32-bit architectures
    y2038: rename old time and utime syscalls
    y2038: remove struct definition redirects
    y2038: use time32 syscall names on 32-bit
    syscalls: remove obsolete __IGNORE_ macros
    y2038: syscalls: rename y2038 compat syscalls
    x86/x32: use time64 versions of sigtimedwait and recvmmsg
    timex: change syscalls to use struct __kernel_timex
    timex: use __kernel_timex internally
    sparc64: add custom adjtimex/clock_adjtime functions
    time: fix sys_timer_settime prototype
    time: Add struct __kernel_timex
    time: make adjtime compat handling available for 32 bit
    ...

    Linus Torvalds
     

05 Mar, 2019

2 commits

  • Pull vfs fixes from Al Viro:
    "Assorted fixes that sat in -next for a while, all over the place"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: Fix locking in aio_poll()
    exec: Fix mem leak in kernel_read_file
    copy_mount_string: Limit string length to PATH_MAX
    cgroup: saner refcounting for cgroup_root
    fix cgroup_do_mount() handling of failure exits

    Linus Torvalds
     
  • Al Viro root-caused a race where the IOCB_CMD_POLL handling of
    fget/fput() could cause us to access the file pointer after it had
    already been freed:

    "In more details - normally IOCB_CMD_POLL handling looks so:

    1) io_submit(2) allocates aio_kiocb instance and passes it to
    aio_poll()

    2) aio_poll() resolves the descriptor to struct file by req->file =
    fget(iocb->aio_fildes)

    3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
    aio_kiocb to 2 (bumps by 1, that is).

    4) aio_poll() calls vfs_poll(). After sanity checks (basically,
    "poll_wait() had been called and only once") it locks the queue.
    That's what the extra reference to iocb had been for - we know we
    can safely access it.

    5) With queue locked, we check if ->woken has already been set to
    true (by aio_poll_wake()) and, if it had been, we unlock the
    queue, drop a reference to aio_kiocb and bugger off - at that
    point it's a responsibility to aio_poll_wake() and the stuff
    called/scheduled by it. That code will drop the reference to file
    in req->file, along with the other reference to our aio_kiocb.

    6) otherwise, we see whether we need to wait. If we do, we unlock the
    queue, drop one reference to aio_kiocb and go away - eventual
    wakeup (or cancel) will deal with the reference to file and with
    the other reference to aio_kiocb

    7) otherwise we remove ourselves from waitqueue (still under the
    queue lock), so that wakeup won't get us. No async activity will
    be happening, so we can safely drop req->file and iocb ourselves.

    If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
    won't get freed under us, so we can do all the checks and locking
    safely. And we don't touch ->file if we detect that case.

    However, vfs_poll() most certainly *does* touch the file it had been
    given. So wakeup coming while we are still in ->poll() might end up
    doing fput() on that file. That case is not too rare, and usually we
    are saved by the still present reference from descriptor table - that
    fput() is not the final one.

    But if another thread closes that descriptor right after our fget()
    and wakeup does happen before ->poll() returns, we are in trouble -
    final fput() done while we are in the middle of a method:

    Al also wrote a patch to take an extra reference to the file descriptor
    to fix this, but I instead suggested we just streamline the whole file
    pointer handling by submit_io() so that the generic aio submission code
    simply keeps the file pointer around until the aio has completed.

    Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
    Acked-by: Al Viro
    Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Feb, 2019

1 commit

  • wake_up_locked() may but does not have to be called with interrupts
    disabled. Since the fuse filesystem calls wake_up_locked() without
    disabling interrupts aio_poll_wake() may be called with interrupts
    enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
    all code that acquires that lock from thread context must disable
    interrupts. Hence change the spin_trylock() call in aio_poll_wake()
    into a spin_trylock_irqsave() call. This patch fixes the following
    lockdep complaint:

    =====================================================
    WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
    5.0.0-rc4-next-20190131 #23 Not tainted
    -----------------------------------------------------
    syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

    and this task is already holding:
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
    which would create a new lock dependency:
    (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (&(&ctx->ctx_lock)->rlock){..-.}

    ... which became SOFTIRQ-irq-safe at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

    to a SOFTIRQ-irq-unsafe lock:
    (&fiq->waitq){+.+.}

    ... which became SOFTIRQ-irq-unsafe at:
    ...
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fiq->waitq);
    local_irq_disable();
    lock(&(&ctx->ctx_lock)->rlock);
    lock(&fiq->waitq);

    lock(&(&ctx->ctx_lock)->rlock);

    *** DEADLOCK ***

    1 lock held by syz-executor2/13779:
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

    the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
    -> (&(&ctx->ctx_lock)->rlock){..-.} {
    IN-SOFTIRQ-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    __do_sys_io_cancel fs/aio.c:2052 [inline]
    __se_sys_io_cancel fs/aio.c:2035 [inline]
    __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.52370+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    the dependencies between the lock to be acquired
    and SOFTIRQ-irq-unsafe lock:
    -> (&fiq->waitq){+.+.} {
    HARDIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    SOFTIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.43450+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    stack backtrace:
    CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
    check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
    check_irq_usage kernel/locking/lockdep.c:1650 [inline]
    check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
    check_prev_add kernel/locking/lockdep.c:1860 [inline]
    check_prevs_add kernel/locking/lockdep.c:1968 [inline]
    validate_chain kernel/locking/lockdep.c:2339 [inline]
    __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported-by: syzbot
    Cc: Christoph Hellwig
    Cc: Avi Kivity
    Cc: Miklos Szeredi
    Cc:
    Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
    Signed-off-by: Miklos Szeredi
    [ bvanassche: added a comment ]
    Reluctantly-Acked-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Al Viro

    Bart Van Assche
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

06 Feb, 2019

1 commit


29 Dec, 2018

5 commits

  • Merge misc updates from Andrew Morton:

    - large KASAN update to use arm's "software tag-based mode"

    - a few misc things

    - sh updates

    - ocfs2 updates

    - just about all of MM

    * emailed patches from Andrew Morton : (167 commits)
    kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
    memcg, oom: notify on oom killer invocation from the charge path
    mm, swap: fix swapoff with KSM pages
    include/linux/gfp.h: fix typo
    mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
    hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    memory_hotplug: add missing newlines to debugging output
    mm: remove __hugepage_set_anon_rmap()
    include/linux/vmstat.h: remove unused page state adjustment macro
    mm/page_alloc.c: allow error injection
    mm: migrate: drop unused argument of migrate_page_move_mapping()
    blkdev: avoid migration stalls for blkdev pages
    mm: migrate: provide buffer_migrate_page_norefs()
    mm: migrate: move migrate_page_lock_buffers()
    mm: migrate: lock buffers before migrate_page_move_mapping()
    mm: migration: factor out code to compute expected number of page references
    mm, page_alloc: enable pcpu_drain with zone capability
    kmemleak: add config to select auto scan
    mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
    ...

    Linus Torvalds
     
  • Pull aio updates from Jens Axboe:
    "Flushing out pre-patches for the buffered/polled aio series. Some
    fixes in here, but also optimizations"

    * tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block:
    aio: abstract out io_event filler helper
    aio: split out iocb copy from io_submit_one()
    aio: use iocb_put() instead of open coding it
    aio: only use blk plugs for > 2 depth submissions
    aio: don't zero entire aio_kiocb aio_get_req()
    aio: separate out ring reservation from req allocation
    aio: use assigned completion handler

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     
  • Pull y2038 updates from Arnd Bergmann:
    "More syscalls and cleanups

    This concludes the main part of the system call rework for 64-bit
    time_t, which has spread over most of year 2018, the last six system
    calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

    As before, nothing changes for 64-bit architectures, while 32-bit
    architectures gain another entry point that differs only in the layout
    of the timespec structure. Hopefully in the next release we can wire
    up all 22 of those system calls on all 32-bit architectures, which
    gives us a baseline version for glibc to start using them.

    This does not include the clock_adjtime, getrusage/waitid, and
    getitimer/setitimer system calls. I still plan to have new versions of
    those as well, but they are not required for correct operation of the
    C library since they can be emulated using the old 32-bit time_t based
    system calls.

    Aside from the system calls, there are also a few cleanups here,
    removing old kernel internal interfaces that have become unused after
    all references got removed. The arch/sh cleanups are part of this,
    there were posted several times over the past year without a reaction
    from the maintainers, while the corresponding changes made it into all
    other architectures"

    * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
    timekeeping: remove obsolete time accessors
    vfs: replace current_kernel_time64 with ktime equivalent
    timekeeping: remove timespec_add/timespec_del
    timekeeping: remove unused {read,update}_persistent_clock
    sh: remove board_time_init() callback
    sh: remove unused rtc_sh_get/set_time infrastructure
    sh: sh03: rtc: push down rtc class ops into driver
    sh: dreamcast: rtc: push down rtc class ops into driver
    y2038: signal: Add compat_sys_rt_sigtimedwait_time64
    y2038: signal: Add sys_rt_sigtimedwait_time32
    y2038: socket: Add compat_sys_recvmmsg_time64
    y2038: futex: Add support for __kernel_timespec
    y2038: futex: Move compat implementation into futex.c
    io_pgetevents: use __kernel_timespec
    pselect6: use __kernel_timespec
    ppoll: use __kernel_timespec
    signal: Add restore_user_sigmask()
    signal: Add set_user_sigmask()

    Linus Torvalds
     
  • All callers of migrate_page_move_mapping() now pass NULL for 'head'
    argument. Drop it.

    Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

18 Dec, 2018

7 commits