09 Jun, 2022

1 commit

  • [ Upstream commit d60c4d01a98bc1942dba6e3adc02031f5519f94b ]

    When running the stress-ng clone benchmark with multiple testing threads,
    it was found that there were significant spinlock contention in sget_fc().
    The contended spinlock was the sb_lock. It is under heavy contention
    because the following code in the critcal section of sget_fc():

    hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
    if (test(old, fc))
    goto share_extant_sb;
    }

    After testing with added instrumentation code, it was found that the
    benchmark could generate thousands of ipc namespaces with the
    corresponding number of entries in the mqueue's fs_supers list where the
    namespaces are the key for the search. This leads to excessive time in
    scanning the list for a match.

    Looking back at the mqueue calling sequence leading to sget_fc():

    mq_init_ns()
    => mq_create_mount()
    => fc_mount()
    => vfs_get_tree()
    => mqueue_get_tree()
    => get_tree_keyed()
    => vfs_get_super()
    => sget_fc()

    Currently, mq_init_ns() is the only mqueue function that will indirectly
    call mqueue_get_tree() with a newly allocated ipc namespace as the key for
    searching. As a result, there will never be a match with the exising ipc
    namespaces stored in the mqueue's fs_supers list.

    So using get_tree_keyed() to do an existing ipc namespace search is just a
    waste of time. Instead, we could use get_tree_nodev() to eliminate the
    useless search. By doing so, we can greatly reduce the sb_lock hold time
    and avoid the spinlock contention problem in case a large number of ipc
    namespaces are present.

    Of course, if the code is modified in the future to allow
    mqueue_get_tree() to be called with an existing ipc namespace instead of a
    new one, we will have to use get_tree_keyed() in this case.

    The following stress-ng clone benchmark command was run on a 2-socket
    48-core Intel system:

    ./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20

    The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
    patch. This is an increase of 54% in performance.

    Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
    Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
    Signed-off-by: Waiman Long
    Cc: Al Viro
    Cc: David Howells
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Waiman Long
     

29 Jun, 2021

1 commit

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     

23 May, 2021

1 commit

  • do_mq_timedreceive calls wq_sleep with a stack local address. The
    sender (do_mq_timedsend) uses this address to later call pipelined_send.

    This leads to a very hard to trigger race where a do_mq_timedreceive
    call might return and leave do_mq_timedsend to rely on an invalid
    address, causing the following crash:

    RIP: 0010:wake_q_add_safe+0x13/0x60
    Call Trace:
    __x64_sys_mq_timedsend+0x2a9/0x490
    do_syscall_64+0x80/0x680
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f5928e40343

    The race occurs as:

    1. do_mq_timedreceive calls wq_sleep with the address of `struct
    ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
    holds a valid `struct ext_wait_queue *` as long as the stack has not
    been overwritten.

    2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
    do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
    __pipelined_op.

    3. Sender calls __pipelined_op::smp_store_release(&this->state,
    STATE_READY). Here is where the race window begins. (`this` is
    `ewq_addr`.)

    4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
    will see `state == STATE_READY` and break.

    5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
    to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
    stack. (Although the address may not get overwritten until another
    function happens to touch it, which means it can persist around for an
    indefinite time.)

    6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
    `struct ext_wait_queue *`, and uses it to find a task_struct to pass to
    the wake_q_add_safe call. In the lucky case where nothing has
    overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
    In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
    bogus address as the receiver's task_struct causing the crash.

    do_mq_timedsend::__pipelined_op() should not dereference `this` after
    setting STATE_READY, as the receiver counterpart is now free to return.
    Change __pipelined_op to call wake_q_add_safe on the receiver's
    task_struct returned by get_task_struct, instead of dereferencing `this`
    which sits on the receiver's stack.

    As Manfred pointed out, the race potentially also exists in
    ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
    those in the same way.

    Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
    Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
    Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
    Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
    Signed-off-by: Varad Gautam
    Reported-by: Matthias von Faber
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Cc: Christian Brauner
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Varad Gautam
     

01 May, 2021

1 commit

  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

24 Jan, 2021

3 commits

  • Extend some inode methods with an additional user namespace argument. A
    filesystem that is aware of idmapped mounts will receive the user
    namespace the mount has been marked with. This can be used for
    additional permission checking and also to enable filesystems to
    translate between uids and gids if they need to. We have implemented all
    relevant helpers in earlier patches.

    As requested we simply extend the exisiting inode method instead of
    introducing new ones. This is a little more code churn but it's mostly
    mechanical and doesnt't leave us with additional inode methods.

    Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The various vfs_*() helpers are called by filesystems or by the vfs
    itself to perform core operations such as create, link, mkdir, mknod, rename,
    rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
    inode is accessed through an idmapped mount map it into the
    mount's user namespace and pass it down. Afterwards the checks and
    operations are identical to non-idmapped mounts. If the initial user
    namespace is passed nothing changes so non-idmapped mounts will see
    identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The two helpers inode_permission() and generic_permission() are used by
    the vfs to perform basic permission checking by verifying that the
    caller is privileged over an inode. In order to handle idmapped mounts
    we extend the two helpers with an additional user namespace argument.
    On idmapped mounts the two helpers will make sure to map the inode
    according to the mount's user namespace and then peform identical
    permission checks to inode_permission() and generic_permission(). If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     

08 May, 2020

1 commit

  • Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
    changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
    longer works if the sender doesn't have rights to send a signal.

    Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
    to avoid check_kill_permission().

    This needs the additional notify.sigev_signo != 0 check, shouldn't we
    change do_mq_notify() to deny sigev_signo == 0 ?

    Test-case:

    #include
    #include
    #include
    #include
    #include

    static int notified;

    static void sigh(int sig)
    {
    notified = 1;
    }

    int main(void)
    {
    signal(SIGIO, sigh);

    int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
    assert(fd >= 0);

    struct sigevent se = {
    .sigev_notify = SIGEV_SIGNAL,
    .sigev_signo = SIGIO,
    };
    assert(mq_notify(fd, &se) == 0);

    if (!fork()) {
    assert(setuid(1) == 0);
    mq_send(fd, "",1,0);
    return 0;
    }

    wait(NULL);
    mq_unlink("/mq");
    assert(notified);
    return 0;
    }

    [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
    Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
    Reported-by: Yoji
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Acked-by: "Eric W. Biederman"
    Cc: Davidlohr Bueso
    Cc: Markus Elfring
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Apr, 2020

1 commit


04 Feb, 2020

2 commits

  • Update and document memory barriers for mqueue.c:

    - ewp->state is read without any locks, thus READ_ONCE is required.

    - add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
    acquire semantics if the value is STATE_READY.

    - use wake_q_add_safe()

    - document why __set_current_state() may be used:
    Reading task->state cannot happen before the wake_q_add() call,
    which happens while holding info->lock. Thus the spin_unlock()
    is the RELEASE, and the spin_lock() is the ACQUIRE.

    For completeness: there is also a 3 CPU scenario, if the to be woken
    up task is already on another wake_q.
    Then:
    - CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
    - CPU2: the spin_lock() of the waker is the ACQUIRE
    - CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
    - CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

    Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Waiman Long
    Cc:
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • pipelined_send() and pipelined_receive() are identical, so merge them.

    [manfred@colorfullife.com: add changelog]
    Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc:
    Cc: Peter Zijlstra
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

26 Sep, 2019

2 commits

  • Null pointers were assigned to local variables in a few cases as exception
    handling. The jump target “out” was used where no meaningful data
    processing actions should eventually be performed by branches of an if
    statement then. Use an additional jump target for calling dev_kfree_skb()
    directly.

    Return also directly after error conditions were detected when no extra
    clean-up is needed by this function implementation.

    Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • dev_kfree_skb() input parameter validation, thus the test around the call
    is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     

06 Sep, 2019

1 commit


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

17 Jul, 2019

1 commit

  • Andreas Christoforou reported:

    UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
    9 * 2305843009213693951 cannot be represented in type 'long int'
    ...
    Call Trace:
    mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
    evict+0x472/0x8c0 fs/inode.c:558
    iput_final fs/inode.c:1547 [inline]
    iput+0x51d/0x8c0 fs/inode.c:1573
    mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
    mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
    vfs_mkobj+0x39e/0x580 fs/namei.c:2892
    prepare_open ipc/mqueue.c:731 [inline]
    do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

    Which could be triggered by:

    struct mq_attr attr = {
    .mq_flags = 0,
    .mq_maxmsg = 9,
    .mq_msgsize = 0x1fffffffffffffff,
    .mq_curmsgs = 0,
    };

    if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
    perror("mq_open");

    mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
    preparing to return -EINVAL. During the cleanup, it calls
    mqueue_evict_inode() which performed resource usage tracking math for
    updating "user", before checking if there was a valid "user" at all
    (which would indicate that the calculations would be sane). Instead,
    delay this check to after seeing a valid "user".

    The overflow was real, but the results went unused, so while the flaw is
    harmless, it's noisy for kernel fuzzers, so just fix it by moving the
    calculation under the non-NULL "user" where it actually gets used.

    Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
    Signed-off-by: Kees Cook
    Reported-by: Andreas Christoforou
    Acked-by: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

26 May, 2019

1 commit


15 May, 2019

3 commits

  • Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
    improve performance of send/recv"). However, consuming a msg in
    msg_get() remains logarithmic (still being better than the case before
    of course). By applying well known techniques to cache pointers we can
    have the node with the highest priority in O(1), which is specially nice
    for the rt cases. Furthermore, some callers can call msg_get() in a
    loop.

    A new msg_tree_erase() helper is also added to encapsulate the tree
    removal and node_cache game. Passes ltp mq testcases.

    Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We already store the current task fo the new waiter before calling
    wq_sleep() in both send and recv paths. Trivially remove the redundant
    assignment.

    Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
    enabled on large memory SMP systems, the pages initialization can take a
    long time, if msgctl10 requests a huge block memory, and it will block
    rcu scheduler, so release cpu actively.

    After adding schedule() in free_msg, free_msg can not be called when
    holding spinlock, so adding msg to a tmp list, and free it out of
    spinlock

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
    rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
    rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
    msgctl10 R running task 21608 32505 2794 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:__is_insn_slot_addr+0xfb/0x250
    Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
    RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
    RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
    RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
    R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
    R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
    kernel_text_address+0xc1/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    create_object+0x380/0x650
    __kmalloc+0x14c/0x2b0
    load_msg+0x38/0x1a0
    do_msgsnd+0x19e/0xcf0
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
    rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
    msgctl10 R running task 21608 32170 32155 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:lock_acquire+0x4d/0x340
    Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
    RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
    RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
    R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
    is_bpf_text_address+0x32/0xe0
    kernel_text_address+0xec/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    save_stack+0x32/0xb0
    __kasan_slab_free+0x130/0x180
    kfree+0xfa/0x2d0
    free_msg+0x24/0x50
    do_msgrcv+0x508/0xe60
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Davidlohr said:
    "So after releasing the lock, the msg rbtree/list is empty and new
    calls will not see those in the newly populated tmp_msg list, and
    therefore they cannot access the delayed msg freeing pointers, which
    is good. Also the fact that the node_cache is now freed before the
    actual messages seems to be harmless as this is wanted for
    msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
    info->lock the thing is freed anyway so it should not change things"

    Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Li RongQing
    Signed-off-by: Zhang Yu
    Reviewed-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Rongqing
     

02 May, 2019

1 commit


13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

28 Feb, 2019

1 commit

  • Convert the mqueue filesystem to use the filesystem context stuff.

    Notes:

    (1) The relevant ipc namespace is selected in when the context is
    initialised (and it defaults to the current task's ipc namespace).
    The caller can override this before calling vfs_get_tree().

    (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
    mq_internal_mount() create a context, adjust it and then do the rest
    of the mount procedure.

    (3) The lazy mqueue mounting on creation of a new namespace is retained
    from a previous patch, but the avoidance of sget() if no superblock
    yet exists is reverted and the superblock is again keyed on the
    namespace pointer.

    Yes, there was a performance gain in not searching the superblock
    hash, but it's only paid once per ipc namespace - and only if someone
    uses mqueue within that namespace, so I'm not sure it's worth it,
    especially as calling sget() allows avoidance of recursion.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

26 Oct, 2018

1 commit

  • Pull timekeeping updates from Thomas Gleixner:
    "The timers and timekeeping departement provides:

    - Another large y2038 update with further preparations for providing
    the y2038 safe timespecs closer to the syscalls.

    - An overhaul of the SHCMT clocksource driver

    - SPDX license identifier updates

    - Small cleanups and fixes all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
    tick/sched : Remove redundant cpu_online() check
    clocksource/drivers/dw_apb: Add reset control
    clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
    clocksource/drivers: Unify the names to timer-* format
    clocksource/drivers/sh_cmt: Add R-Car gen3 support
    dt-bindings: timer: renesas: cmt: document R-Car gen3 support
    clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
    clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
    clocksource/drivers/sh_cmt: Fixup for 64-bit machines
    clocksource/drivers/sh_tmu: Convert to SPDX identifiers
    clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
    clocksource/drivers/sh_cmt: Convert to SPDX identifiers
    clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
    clocksource: Convert to using %pOFn instead of device_node.name
    tick/broadcast: Remove redundant check
    RISC-V: Request newstat syscalls
    y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
    y2038: socket: Change recvmmsg to use __kernel_timespec
    y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
    y2038: utimes: Rework #ifdef guards for compat syscalls
    ...

    Linus Torvalds
     

03 Oct, 2018

1 commit

  • Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.

    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.

    So split siginfo in two; kernel_siginfo and siginfo. Keeping the
    traditional name for the userspace definition. While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.

    The definition of struct kernel_siginfo I have put in include/signal_types.h

    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.

    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo. The reduction in size comes in a following change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Aug, 2018

1 commit

  • Christoph Hellwig suggested a slightly different path for handling
    backwards compatibility with the 32-bit time_t based system calls:

    Rather than simply reusing the compat_sys_* entry points on 32-bit
    architectures unchanged, we get rid of those entry points and the
    compat_time types by renaming them to something that makes more sense
    on 32-bit architectures (which don't have a compat mode otherwise),
    and then share the entry points under the new name with the 64-bit
    architectures that use them for implementing the compatibility.

    The following types and interfaces are renamed here, and moved
    from linux/compat_time.h to linux/time32.h:

    old new
    --- ---
    compat_time_t old_time32_t
    struct compat_timeval struct old_timeval32
    struct compat_timespec struct old_timespec32
    struct compat_itimerspec struct old_itimerspec32
    ns_to_compat_timeval() ns_to_old_timeval32()
    get_compat_itimerspec64() get_old_itimerspec32()
    put_compat_itimerspec64() put_old_itimerspec32()
    compat_get_timespec64() get_old_timespec32()
    compat_put_timespec64() put_old_timespec32()

    As we already have aliases in place, this patch addresses only the
    instances that are relevant to the system call interface in particular,
    not those that occur in device drivers and other modules. Those
    will get handled separately, while providing the 64-bit version
    of the respective interfaces.

    I'm not renaming the timex, rusage and itimerval structures, as we are
    still debating what the new interface will look like, and whether we
    will need a replacement at all.

    This also doesn't change the names of the syscall entry points, which can
    be done more easily when we actually switch over the 32-bit architectures
    to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
    SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

    Suggested-by: Christoph Hellwig
    Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

20 Apr, 2018

2 commits

  • Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
    take a timespec argument. After we move 32-bit architectures over to
    useing 64-bit time_t based syscalls, we need seperate entry points for
    the old 32-bit based interfaces.

    This changes the #ifdef guards for the existing 32-bit compat syscalls
    to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
    enabled on all existing 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • This is a preparatation for changing over __kernel_timespec to 64-bit
    times, which involves assigning new system call numbers for mq_timedsend(),
    mq_timedreceive() and semtimedop() for compatibility with future y2038
    proof user space.

    The existing ABIs will remain available through compat code.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

25 Mar, 2018

1 commit

  • This reverts commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21.

    Aleksa Sarai writes:
    > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
    >
    > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
    > which was introduced by 36735a6a2b5e ("mqueue: switch to on-demand
    > creation of internal mount").
    >
    > Basically, the reproducer boils down to being able to mount mqueue if
    > you create a new user namespace, even if you don't unshare the IPC
    > namespace.
    >
    > Previously this was not possible, and you would get an -EPERM. The mount
    > is the *host* mqueue mount, which is being cached and just returned from
    > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
    > it was intentional -- since I'm not familiar with mqueue).
    >
    > To me it looks like there is a missing permission check. I've included a
    > patch below that I've compile-tested, and should block the above case.
    > Can someone please tell me if I'm missing something? Is this actually
    > safe?
    >
    > [1]: https://github.com/docker/docker/issues/36674

    The issue is a lot deeper than a missing permission check. sb->s_user_ns
    was is improperly set as well. So in addition to the filesystem being
    mounted when it should not be mounted, so things are not allow that should
    be.

    We are practically to the release of 4.16 and there is no agreement between
    Al Viro and myself on what the code should looks like to fix things properly.
    So revert the code to what it was before so that we can take our time
    and discuss this properly.

    Fixes: 36735a6a2b5e ("mqueue: switch to on-demand creation of internal mount")
    Reported-by: Felix Abecassis
    Reported-by: Aleksa Sarai
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Feb, 2018

1 commit

  • Previous behavior added tasks to the work queue using the static_prio
    value instead of the dynamic priority value in prio. This caused RT tasks
    to be added to the work queue in a FIFO manner rather than by priority.
    Normal tasks were handled by priority.

    This fix utilizes the dynamic priority of the task to ensure that both RT
    and normal tasks are added to the work queue in priority order. Utilizing
    the dynamic priority (prio) rather than the base priority (normal_prio)
    was chosen to ensure that if a task had a boosted priority when it was
    added to the work queue, it would be woken sooner to to ensure that it
    releases any other locks it may be holding in a more timely manner. It is
    understood that the task could have a lower priority when it wakes than
    when it was added to the queue in this (unlikely) case.

    Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu
    Signed-off-by: Jonathan Haws
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Davidlohr Bueso
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Deepa Dinamani
    Cc: Thomas Gleixner
    Cc: Sebastian Andrzej Siewior
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Haws
     

31 Jan, 2018

2 commits

  • Pull mqueue/bpf vfs cleanups from Al Viro:
    "mqueue and bpf go through rather painful and similar contortions to
    create objects in their dentry trees. Provide a primitive for doing
    that without abusing ->mknod(), switch bpf and mqueue to it.

    Another mqueue-related thing that has ended up in that branch is
    on-demand creation of internal mount (based upon the work of Giuseppe
    Scrivano)"

    * 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: switch to on-demand creation of internal mount
    tidy do_mq_open() up a bit
    mqueue: clean prepare_open() up
    do_mq_open(): move all work prior to dentry_open() into a helper
    mqueue: fold mq_attr_ok() into mqueue_get_inode()
    move dentry_open() calls up into do_mq_open()
    mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
    bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
    new primitive: vfs_mkobj()

    Linus Torvalds
     
  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

13 Jan, 2018

1 commit

  • Call clear_siginfo to ensure stack allocated siginfos are fully
    initialized before being passed to the signal sending functions.

    This ensures that if there is the kind of confusion documented by
    TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
    data to userspace when the kernel generates a signal with SI_USER but
    the copy to userspace assumes it is a different kind of signal, and
    different fields are initialized.

    This also prepares the way for turning copy_siginfo_to_user
    into a copy_to_user, by removing the need in many cases to perform
    a field by field copy simply to skip the uninitialized fields.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 Jan, 2018

5 commits