29 Feb, 2020

1 commit

  • commit edf28f4061afe4c2d9eb1c3323d90e882c1d6800 upstream.

    This reverts commit a97955844807e327df11aa33869009d14d6b7de0.

    Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
    in exit_sem()") removes a lock that is needed. This leads to a process
    looping infinitely in exit_sem() and can also lead to a crash. There is
    a reproducer available in [1] and with the commit reverted the issue
    does not reproduce anymore.

    Using the reproducer found in [1] is fairly easy to reach a point where
    one of the child processes is looping infinitely in exit_sem between
    for(;;) and if (semid == -1) block, while it's trying to free its last
    sem_undo structure which has already been freed by freeary().

    Each sem_undo struct is on two lists: one per semaphore set (list_id)
    and one per process (list_proc). The list_id list tracks undos by
    semaphore set, and the list_proc by process.

    Undo structures are removed either by freeary() or by exit_sem(). The
    freeary function is invoked when the user invokes a syscall to remove a
    semaphore set. During this operation freeary() traverses the list_id
    associated with the semaphore set and removes the undo structures from
    both the list_id and list_proc lists.

    For this case, exit_sem() is called at process exit. Each process
    contains a struct sem_undo_list (referred to as "ulp") which contains
    the head for the list_proc list. When the process exits, exit_sem()
    traverses this list to remove each sem_undo struct. As in freeary(),
    whenever a sem_undo struct is removed from list_proc, it is also removed
    from the list_id list.

    Removing elements from list_id is safe for both exit_sem() and freeary()
    due to sem_lock(). Removing elements from list_proc is not safe;
    freeary() locks &un->ulp->lock when it performs
    list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
    removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
    lock usage in exit_sem()").

    This can result in the following situation while executing the
    reproducer [1] : Consider a child process in exit_sem() and the parent
    in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).

    - The list_proc for the child contains the last two undo structs A and
    B (the rest have been removed either by exit_sem() or freeary()).

    - The semid for A is 1 and semid for B is 2.

    - exit_sem() removes A and at the same time freeary() removes B.

    - Since A and B have different semid sem_lock() will acquire different
    locks for each process and both can proceed.

    The bug is that they remove A and B from the same list_proc at the same
    time because only freeary() acquires the ulp lock. When exit_sem()
    removes A it makes ulp->list_proc.next to point at B and at the same
    time freeary() removes B setting B->semid=-1.

    At the next iteration of for(;;) loop exit_sem() will try to remove B.

    The only way to break from for(;;) is for (&un->list_proc ==
    &ulp->list_proc) to be true which is not. Then exit_sem() will check if
    B->semid=-1 which is and will continue looping in for(;;) until the
    memory for B is reallocated and the value at B->semid is changed.

    At that point, exit_sem() will crash attempting to unlink B from the
    lists (this can be easily triggered by running the reproducer [1] a
    second time).

    To prove this scenario instrumentation was added to keep information
    about each sem_undo (un) struct that is removed per process and per
    semaphore set (sma).

    CPU0 CPU1
    [caller holds sem_lock(sma for A)] ...
    freeary() exit_sem()
    ... ...
    ... sem_lock(sma for B)
    spin_lock(A->ulp->lock) ...
    list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)

    Undo structures A and B have different semid and sem_lock() operations
    proceed. However they belong to the same list_proc list and they are
    removed at the same time. This results into ulp->list_proc.next
    pointing to the address of B which is already removed.

    After reverting commit a97955844807 ("ipc,sem: remove uneeded
    sem_undo_list lock usage in exit_sem()") the issue was no longer
    reproducible.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779

    Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
    Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
    Signed-off-by: Ioanna Alifieraki
    Acked-by: Manfred Spraul
    Acked-by: Herton R. Krzesinski
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc:
    Cc: Joel Fernandes (Google)
    Cc: Davidlohr Bueso
    Cc: Jay Vosburgh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ioanna Alifieraki
     

11 Feb, 2020

1 commit

  • commit 889b331724c82c11e15ba0a60979cf7bded0a26c upstream.

    A use of uninitialized memory in msgctl_down() because msqid64 in
    ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
    ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
    is never initialized before msgctl_down() checks msqid64->msg_qbytes.

    KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
    reports:

    ==================================================================
    BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
    Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

    CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x75/0xae
    __kumsan_report+0x17c/0x3e6
    kumsan_report+0xe/0x20
    msgctl_down+0x94/0x300
    ksys_msgctl.constprop.14+0xef/0x260
    do_syscall_64+0x7e/0x1f0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x4400e9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
    R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

    The buggy address belongs to the page:
    page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
    flags: 0x100000000000000()
    raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
    raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: kumsan: bad access detected
    ==================================================================

    Syzkaller reproducer:
    msgctl$IPC_RMID(0x0, 0x0)

    C reproducer:
    // autogenerated by syzkaller (https://github.com/google/syzkaller)

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    syscall(__NR_msgctl, 0, 0, 0);
    return 0;
    }

    [natechancellor@gmail.com: adjust indentation in ksys_msgctl]
    Link: https://github.com/ClangBuiltLinux/linux/issues/829
    Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
    Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
    Signed-off-by: Lu Shuaibing
    Signed-off-by: Nathan Chancellor
    Suggested-by: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: NeilBrown
    From: Andrew Morton
    Subject: ipc/msg.c: consolidate all xxxctl_down() functions

    Each line here overflows 80 cols by exactly one character. Delete one tab
    per line to fix.

    Cc: Shaohua Li
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Lu Shuaibing
     

26 Sep, 2019

3 commits

  • CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep
    expression if using srcu or locking for protection. It can only check
    regular RCU protection, all other protection needs to be passed as lockdep
    expression.

    Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Cc: Arnd Bergmann
    Cc: Bjorn Helgaas
    Cc: Catalin Marinas
    Cc: "Gustavo A. R. Silva"
    Cc: Jonathan Derrick
    Cc: Keith Busch
    Cc: Lorenzo Pieralisi
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     
  • Null pointers were assigned to local variables in a few cases as exception
    handling. The jump target “out” was used where no meaningful data
    processing actions should eventually be performed by branches of an if
    statement then. Use an additional jump target for calling dev_kfree_skb()
    directly.

    Return also directly after error conditions were detected when no extra
    clean-up is needed by this function implementation.

    Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • dev_kfree_skb() input parameter validation, thus the test around the call
    is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     

19 Sep, 2019

1 commit

  • Pull vfs mount API infrastructure updates from Al Viro:
    "Infrastructure bits of mount API conversions.

    The rest is more of per-filesystem updates and that will happen
    in separate pull requests"

    * 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mtd: Provide fs_context-aware mount_mtd() replacement
    vfs: Create fs_context-aware mount_bdev() replacement
    new helper: get_tree_keyed()
    vfs: set fs_context::user_ns for reconfigure

    Linus Torvalds
     

08 Sep, 2019

1 commit

  • Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
    to a commit from my y2038 series in linux-5.1, as I missed the custom
    sys_ipc() wrapper that sparc64 uses in place of the generic version that
    I patched.

    The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
    now do not allow being called with the IPC_64 flag any more, resulting
    in a -EINVAL error when they don't recognize the command.

    Instead, the correct way to do this now is to call the internal
    ksys_old_{sem,shm,msg}ctl() functions to select the API version.

    As we generally move towards these functions anyway, change all of
    sparc_ipc() to consistently use those in place of the sys_*() versions,
    and move the required ksys_*() declarations into linux/syscalls.h

    The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
    errors when ipc is disabled.

    Reported-by: Matt Turner
    Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
    Cc: stable@vger.kernel.org
    Tested-by: Matt Turner
    Tested-by: Anatoly Pugachev
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

06 Sep, 2019

1 commit


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • Andreas Christoforou reported:

    UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
    9 * 2305843009213693951 cannot be represented in type 'long int'
    ...
    Call Trace:
    mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
    evict+0x472/0x8c0 fs/inode.c:558
    iput_final fs/inode.c:1547 [inline]
    iput+0x51d/0x8c0 fs/inode.c:1573
    mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
    mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
    vfs_mkobj+0x39e/0x580 fs/namei.c:2892
    prepare_open ipc/mqueue.c:731 [inline]
    do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

    Which could be triggered by:

    struct mq_attr attr = {
    .mq_flags = 0,
    .mq_maxmsg = 9,
    .mq_msgsize = 0x1fffffffffffffff,
    .mq_curmsgs = 0,
    };

    if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
    perror("mq_open");

    mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
    preparing to return -EINVAL. During the cleanup, it calls
    mqueue_evict_inode() which performed resource usage tracking math for
    updating "user", before checking if there was a valid "user" at all
    (which would indicate that the calculations would be sane). Instead,
    delay this check to after seeing a valid "user".

    The overflow was real, but the results went unused, so while the flaw is
    harmless, it's noisy for kernel fuzzers, so just fix it by moving the
    calculation under the non-NULL "user" where it actually gets used.

    Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
    Signed-off-by: Kees Cook
    Reported-by: Andreas Christoforou
    Acked-by: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

26 May, 2019

1 commit


24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under gnu general public licence version 2 or
    at your option any later version see the file copying for more
    details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520071857.941092988@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

6 commits

  • For ipcmni_extend mode, the sequence number space is only 7 bits. So
    the chance of id reuse is relatively high compared with the non-extended
    mode.

    To alleviate this id reuse problem, this patch enables cyclic allocation
    for the index to the radix tree (idx). The disadvantage is that this
    can cause a slight slow-down of the fast path, as the radix tree could
    be higher than necessary.

    To limit the radix tree height, I have chosen the following limits:
    1) The cycling is done over in_use*1.5.
    2) At least, the cycling is done over
    "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
    "ipcmni_extended": 4096 elements

    Result:
    - for normal mode:
    No change for 4095 active objects until the 3rd level
    is added without cyclic allocation.

    For a 2-level radix tree compared to a 1-level radix tree, I have
    observed < 1% performance impact.

    Notes:
    1) Normal "x=semget();y=semget();" is unaffected: Then the idx
    is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
    is used.

    2) The -1% happens in a microbenchmark after this situation:
    x=semget();
    for(i=0;i<
    Acked-by: Waiman Long
    Cc: "Luis R. Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: "Eric W . Biederman"
    Cc: Takashi Iwai
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Rewrite, based on the patch from Waiman Long:

    The mixing in of a sequence number into the IPC IDs is probably to avoid
    ID reuse in userspace as much as possible. With ipcmni_extend mode, the
    number of usable sequence numbers is greatly reduced leading to higher
    chance of ID reuse.

    To address this issue, we need to conserve the sequence number space as
    much as possible. Right now, the sequence number is incremented for
    every new ID created. In reality, we only need to increment the
    sequence number when new allocated ID is not greater than the last one
    allocated. It is in such case that the new ID may collide with an
    existing one. This is being done irrespective of the ipcmni mode.

    In order to avoid any races, the index is first allocated and then the
    pointer is replaced.

    Changes compared to the initial patch:
    - Handle failures from idr_alloc().
    - Avoid that concurrent operations can see the wrong sequence number.
    (This is achieved by using idr_replace()).
    - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
    ipcmni_seq_shift().
    - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

    Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
    Signed-off-by: Manfred Spraul
    Signed-off-by: Waiman Long
    Suggested-by: Matthew Wilcox
    Acked-by: Waiman Long
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: "Eric W . Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: "Luis R. Rodriguez"
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The maximum number of unique System V IPC identifiers was limited to
    32k. That limit should be big enough for most use cases.

    However, there are some users out there requesting for more, especially
    those that are migrating from Solaris which uses 24 bits for unique
    identifiers. To satisfy the need of those users, a new boot time kernel
    option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This
    is a 512X increase which should be big enough for users out there that
    need a large number of unique IPC identifier.

    The use of this new option will change the pattern of the IPC
    identifiers returned by functions like shmget(2). An application that
    depends on such pattern may not work properly. So it should only be
    used if the users really need more than 32k of unique IPC numbers.

    This new option does have the side effect of reducing the maximum number
    of unique sequence numbers from 64k down to 128. So it is a trade-off.

    The computation of a new IPC id is not done in the performance critical
    path. So a little bit of additional overhead shouldn't have any real
    performance impact.

    Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Manfred Spraul
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: "Eric W . Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: "Luis R. Rodriguez"
    Cc: Matthew Wilcox
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
    improve performance of send/recv"). However, consuming a msg in
    msg_get() remains logarithmic (still being better than the case before
    of course). By applying well known techniques to cache pointers we can
    have the node with the highest priority in O(1), which is specially nice
    for the rt cases. Furthermore, some callers can call msg_get() in a
    loop.

    A new msg_tree_erase() helper is also added to encapsulate the tree
    removal and node_cache game. Passes ltp mq testcases.

    Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We already store the current task fo the new waiter before calling
    wq_sleep() in both send and recv paths. Trivially remove the redundant
    assignment.

    Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
    enabled on large memory SMP systems, the pages initialization can take a
    long time, if msgctl10 requests a huge block memory, and it will block
    rcu scheduler, so release cpu actively.

    After adding schedule() in free_msg, free_msg can not be called when
    holding spinlock, so adding msg to a tmp list, and free it out of
    spinlock

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
    rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
    rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
    msgctl10 R running task 21608 32505 2794 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:__is_insn_slot_addr+0xfb/0x250
    Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
    RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
    RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
    RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
    R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
    R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
    kernel_text_address+0xc1/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    create_object+0x380/0x650
    __kmalloc+0x14c/0x2b0
    load_msg+0x38/0x1a0
    do_msgsnd+0x19e/0xcf0
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
    rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
    msgctl10 R running task 21608 32170 32155 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:lock_acquire+0x4d/0x340
    Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
    RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
    RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
    R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
    is_bpf_text_address+0x32/0xe0
    kernel_text_address+0xec/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    save_stack+0x32/0xb0
    __kasan_slab_free+0x130/0x180
    kfree+0xfa/0x2d0
    free_msg+0x24/0x50
    do_msgrcv+0x508/0xe60
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Davidlohr said:
    "So after releasing the lock, the msg rbtree/list is empty and new
    calls will not see those in the newly populated tmp_msg list, and
    therefore they cannot access the delayed msg freeing pointers, which
    is good. Also the fact that the node_cache is now freed before the
    actual messages seems to be harmless as this is wanted for
    msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
    info->lock the thing is freed anyway so it should not change things"

    Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Li RongQing
    Signed-off-by: Zhang Yu
    Reviewed-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Rongqing
     

08 May, 2019

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

    2) Add fib_sync_mem to control the amount of dirty memory we allow to
    queue up between synchronize RCU calls, from David Ahern.

    3) Make flow classifier more lockless, from Vlad Buslov.

    4) Add PHY downshift support to aquantia driver, from Heiner
    Kallweit.

    5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
    contention on SLAB spinlocks in heavy RPC workloads.

    6) Partial GSO offload support in XFRM, from Boris Pismenny.

    7) Add fast link down support to ethtool, from Heiner Kallweit.

    8) Use siphash for IP ID generator, from Eric Dumazet.

    9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
    entries, from David Ahern.

    10) Move skb->xmit_more into a per-cpu variable, from Florian
    Westphal.

    11) Improve eBPF verifier speed and increase maximum program size,
    from Alexei Starovoitov.

    12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
    spinlocks. From Neil Brown.

    13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

    14) Improve link partner cap detection in generic PHY code, from
    Heiner Kallweit.

    15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
    Maguire.

    16) Remove SKB list implementation assumptions in SCTP, your's truly.

    17) Various cleanups, optimizations, and simplifications in r8169
    driver. From Heiner Kallweit.

    18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

    19) Switch PHY drivers over to use dynamic featue detection, from
    Heiner Kallweit.

    20) Support flow steering without masking in dpaa2-eth, from Ioana
    Ciocoi.

    21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
    Pirko.

    22) Increase the strict parsing of current and future netlink
    attributes, also export such policies to userspace. From Johannes
    Berg.

    23) Allow DSA tag drivers to be modular, from Andrew Lunn.

    24) Remove legacy DSA probing support, also from Andrew Lunn.

    25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
    Haabendal.

    26) Add a generic tracepoint for TX queue timeouts to ease debugging,
    from Cong Wang.

    27) More indirect call optimizations, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
    cxgb4: Fix error path in cxgb4_init_module
    net: phy: improve pause mode reporting in phy_print_status
    dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
    net: macb: Change interrupt and napi enable order in open
    net: ll_temac: Improve error message on error IRQ
    net/sched: remove block pointer from common offload structure
    net: ethernet: support of_get_mac_address new ERR_PTR error
    net: usb: smsc: fix warning reported by kbuild test robot
    staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
    net: dsa: support of_get_mac_address new ERR_PTR error
    net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
    vrf: sit mtu should not be updated when vrf netdev is the link
    net: dsa: Fix error cleanup path in dsa_init_module
    l2tp: Fix possible NULL pointer dereference
    taprio: add null check on sched_nest to avoid potential null pointer dereference
    net: mvpp2: cls: fix less than zero check on a u32 variable
    net_sched: sch_fq: handle non connected flows
    net_sched: sch_fq: do not assume EDT packets are ordered
    net: hns3: use devm_kcalloc when allocating desc_cb
    net: hns3: some cleanup for struct hns3_enet_ring
    ...

    Linus Torvalds
     

02 May, 2019

1 commit


08 Apr, 2019

1 commit

  • This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
    bucket pointer to lock the hash chain for that bucket.

    The benefits of a bit spin_lock are:
    - no need to allocate a separate array of locks.
    - no need to have a configuration option to guide the
    choice of the size of this array
    - locking cost is often a single test-and-set in a cache line
    that will have to be loaded anyway. When inserting at, or removing
    from, the head of the chain, the unlock is free - writing the new
    address in the bucket head implicitly clears the lock bit.
    For __rhashtable_insert_fast() we ensure this always happens
    when adding a new key.
    - even when lockings costs 2 updates (lock and unlock), they are
    in a cacheline that needs to be read anyway.

    The cost of using a bit spin_lock is a little bit of code complexity,
    which I think is quite manageable.

    Bit spin_locks are sometimes inappropriate because they are not fair -
    if multiple CPUs repeatedly contend of the same lock, one CPU can
    easily be starved. This is not a credible situation with rhashtable.
    Multiple CPUs may want to repeatedly add or remove objects, but they
    will typically do so at different buckets, so they will attempt to
    acquire different locks.

    As we have more bit-locks than we previously had spinlocks (by at
    least a factor of two) we can expect slightly less contention to
    go with the slightly better cache behavior and reduced memory
    consumption.

    To enhance type checking, a new struct is introduced to represent the
    pointer plus lock-bit
    that is stored in the bucket-table. This is "struct rhash_lock_head"
    and is empty. A pointer to this needs to be cast to either an
    unsigned lock, or a "struct rhash_head *" to be useful.
    Variables of this type are most often called "bkt".

    Previously "pprev" would sometimes point to a bucket, and sometimes a
    ->next pointer in an rhash_head. As these are now different types,
    pprev is NULL when it would have pointed to the bucket. In that case,
    'blk' is used, together with correct locking protocol.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

08 Mar, 2019

2 commits

  • Use kvzalloc() instead of kvmalloc() and memset().

    Also, make use of the struct_size() helper instead of the open-coded
    version in order to avoid any potential type mistakes.

    This code was detected with the help of Coccinelle.

    Link: http://lkml.kernel.org/r/20190131214221.GA28930@embeddedor
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • There is a plan to build the kernel with -Wimplicit-fallthrough and this
    place in the code produced a warning (W=1).

    This commit remove the following warning:

    ipc/sem.c:1683:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

    Link: http://lkml.kernel.org/r/20190114203608.18218-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

28 Feb, 2019

1 commit

  • Convert the mqueue filesystem to use the filesystem context stuff.

    Notes:

    (1) The relevant ipc namespace is selected in when the context is
    initialised (and it defaults to the current task's ipc namespace).
    The caller can override this before calling vfs_get_tree().

    (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
    mq_internal_mount() create a context, adjust it and then do the rest
    of the mount procedure.

    (3) The lazy mqueue mounting on creation of a new namespace is retained
    from a previous patch, but the avoidance of sget() if no superblock
    yet exists is reverted and the superblock is again keyed on the
    namespace pointer.

    Yes, there was a performance gain in not searching the superblock
    hash, but it's only paid once per ipc namespace - and only if someone
    uses mqueue within that namespace, so I'm not sure it's worth it,
    especially as calling sget() allows avoidance of recursion.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

26 Jan, 2019

1 commit

  • The behavior of these system calls is slightly different between
    architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
    symbol. Most architectures that implement the split IPC syscalls don't set
    that symbol and only get the modern version, but alpha, arm, microblaze,
    mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.

    For the architectures that so far only implement sys_ipc(), i.e. m68k,
    mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
    when adding the split syscalls, so we need to distinguish between the
    two groups of architectures.

    The method I picked for this distinction is to have a separate system call
    entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
    does not. The system call tables of the five architectures are changed
    accordingly.

    As an additional benefit, we no longer need the configuration specific
    definition for ipc_parse_version(), it always does the same thing now,
    but simply won't get called on architectures with the modern interface.

    A small downside is that on architectures that do set
    ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
    that are never called. They only add a few bytes of bloat, so it seems
    better to keep them compared to adding yet another Kconfig symbol.
    I considered adding new syscall numbers for the IPC_64 variants for
    consistency, but decided against that for now.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

18 Jan, 2019

1 commit

  • The sys_ipc() and compat_ksys_ipc() functions are meant to only
    be used from the system call table, not called by another function.

    Introduce ksys_*() interfaces for this purpose, as we have done
    for many other system calls.

    Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Heiko Carstens
    [heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Arnd Bergmann
     

31 Oct, 2018

2 commits

  • For SysV semaphores, the semmni value is the last part of the 4-element
    sem number array. To make semmni behave in a similar way to msgmni and
    shmmni, we can't directly use the _minmax handler. Instead, a special sem
    specific handler is added to check the last argument to make sure that it
    is limited to the [0, IPCMNI] range. An error will be returned if this is
    not the case.

    Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Davidlohr Bueso
    Cc: "Eric W. Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: Luis R. Rodriguez
    Cc: Matthew Wilcox
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Patch series "ipc: IPCMNI limit check for *mni & increase that limit", v9.

    The sysctl parameters msgmni, shmmni and semmni have an inherent limit of
    IPC_MNI (32k). However, users may not be aware of that because they can
    write a value much higher than that without getting any error or
    notification. Reading the parameters back will show the newly written
    values which are not real.

    The real IPCMNI limit is now enforced to make sure that users won't put in
    an unrealistic value. The first 2 patches enforce the limits.

    There are also users out there requesting increase in the IPCMNI value.
    The last 2 patches attempt to do that by using a boot kernel parameter
    "ipcmni_extend" to increase the IPCMNI limit from 32k to 8M if the users
    really want the extended value.

    This patch (of 4):

    A user can write arbitrary integer values to msgmni and shmmni sysctl
    parameters without getting error, but the actual limit is really IPCMNI
    (32k). This can mislead users as they think they can get a value that is
    not real.

    The right limits are now set for msgmni and shmmni so that the users will
    become aware if they set a value outside of the acceptable range.

    Link: http://lkml.kernel.org/r/1536352137-12003-2-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Luis R. Rodriguez
    Reviewed-by: Davidlohr Bueso
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: "Eric W. Biederman"
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

26 Oct, 2018

1 commit

  • Pull timekeeping updates from Thomas Gleixner:
    "The timers and timekeeping departement provides:

    - Another large y2038 update with further preparations for providing
    the y2038 safe timespecs closer to the syscalls.

    - An overhaul of the SHCMT clocksource driver

    - SPDX license identifier updates

    - Small cleanups and fixes all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
    tick/sched : Remove redundant cpu_online() check
    clocksource/drivers/dw_apb: Add reset control
    clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
    clocksource/drivers: Unify the names to timer-* format
    clocksource/drivers/sh_cmt: Add R-Car gen3 support
    dt-bindings: timer: renesas: cmt: document R-Car gen3 support
    clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
    clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
    clocksource/drivers/sh_cmt: Fixup for 64-bit machines
    clocksource/drivers/sh_tmu: Convert to SPDX identifiers
    clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
    clocksource/drivers/sh_cmt: Convert to SPDX identifiers
    clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
    clocksource: Convert to using %pOFn instead of device_node.name
    tick/broadcast: Remove redundant check
    RISC-V: Request newstat syscalls
    y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
    y2038: socket: Change recvmmsg to use __kernel_timespec
    y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
    y2038: utimes: Rework #ifdef guards for compat syscalls
    ...

    Linus Torvalds
     

24 Oct, 2018

1 commit

  • …iederm/user-namespace

    Pull siginfo updates from Eric Biederman:
    "I have been slowly sorting out siginfo and this is the culmination of
    that work.

    The primary result is in several ways the signal infrastructure has
    been made less error prone. The code has been updated so that manually
    specifying SEND_SIG_FORCED is never necessary. The conversion to the
    new siginfo sending functions is now complete, which makes it
    difficult to send a signal without filling in the proper siginfo
    fields.

    At the tail end of the patchset comes the optimization of decreasing
    the size of struct siginfo in the kernel from 128 bytes to about 48
    bytes on 64bit. The fundamental observation that enables this is by
    definition none of the known ways to use struct siginfo uses the extra
    bytes.

    This comes at the cost of a small user space observable difference.
    For the rare case of siginfo being injected into the kernel only what
    can be copied into kernel_siginfo is delivered to the destination, the
    rest of the bytes are set to 0. For cases where the signal and the
    si_code are known this is safe, because we know those bytes are not
    used. For cases where the signal and si_code combination is unknown
    the bits that won't fit into struct kernel_siginfo are tested to
    verify they are zero, and the send fails if they are not.

    I made an extensive search through userspace code and I could not find
    anything that would break because of the above change. If it turns out
    I did break something it will take just the revert of a single change
    to restore kernel_siginfo to the same size as userspace siginfo.

    Testing did reveal dependencies on preferring the signo passed to
    sigqueueinfo over si->signo, so bit the bullet and added the
    complexity necessary to handle that case.

    Testing also revealed bad things can happen if a negative signal
    number is passed into the system calls. Something no sane application
    will do but something a malicious program or a fuzzer might do. So I
    have fixed the code that performs the bounds checks to ensure negative
    signal numbers are handled"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
    signal: Guard against negative signal numbers in copy_siginfo_from_user32
    signal: Guard against negative signal numbers in copy_siginfo_from_user
    signal: In sigqueueinfo prefer sig not si_signo
    signal: Use a smaller struct siginfo in the kernel
    signal: Distinguish between kernel_siginfo and siginfo
    signal: Introduce copy_siginfo_from_user and use it's return value
    signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
    signal: Fail sigqueueinfo if si_signo != sig
    signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
    signal/unicore32: Use force_sig_fault where appropriate
    signal/unicore32: Generate siginfo in ucs32_notify_die
    signal/unicore32: Use send_sig_fault where appropriate
    signal/arc: Use force_sig_fault where appropriate
    signal/arc: Push siginfo generation into unhandled_exception
    signal/ia64: Use force_sig_fault where appropriate
    signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
    signal/ia64: Use the generic force_sigsegv in setup_frame
    signal/arm/kvm: Use send_sig_mceerr
    signal/arm: Use send_sig_fault where appropriate
    signal/arm: Use force_sig_fault where appropriate
    ...

    Linus Torvalds
     

06 Oct, 2018

1 commit

  • This uses ERR_CAST() instead of an open-coded cast, as it is casting
    across structure pointers, which upsets __randomize_layout:

    ipc/shm.c: In function `shm_lock':
    ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'

    return (void *)ipcp;
    ^~~~~~~~~~~~

    Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
    Fixes: 82061c57ce93 ("ipc: drop ipc_lock()")
    Signed-off-by: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

03 Oct, 2018

1 commit

  • Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.

    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.

    So split siginfo in two; kernel_siginfo and siginfo. Keeping the
    traditional name for the userspace definition. While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.

    The definition of struct kernel_siginfo I have put in include/signal_types.h

    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.

    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo. The reduction in size comes in a following change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Sep, 2018

1 commit

  • When getting rid of the general ipc_lock(), this was missed furthermore,
    making the comment around the ipc object validity check bogus. Under
    EIDRM conditions, callers will in turn not see the error and continue
    with the operation.

    Link: http://lkml.kernel.org/r/20180824030920.GD3677@linux-r8p5
    Link: http://lkml.kernel.org/r/20180823024051.GC13343@shao2-debian
    Fixes: 82061c57ce9 ("ipc: drop ipc_lock()")
    Signed-off-by: Davidlohr Bueso
    Reported-by: kernel test robot
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

27 Aug, 2018

1 commit

  • Christoph Hellwig suggested a slightly different path for handling
    backwards compatibility with the 32-bit time_t based system calls:

    Rather than simply reusing the compat_sys_* entry points on 32-bit
    architectures unchanged, we get rid of those entry points and the
    compat_time types by renaming them to something that makes more sense
    on 32-bit architectures (which don't have a compat mode otherwise),
    and then share the entry points under the new name with the 64-bit
    architectures that use them for implementing the compatibility.

    The following types and interfaces are renamed here, and moved
    from linux/compat_time.h to linux/time32.h:

    old new
    --- ---
    compat_time_t old_time32_t
    struct compat_timeval struct old_timeval32
    struct compat_timespec struct old_timespec32
    struct compat_itimerspec struct old_itimerspec32
    ns_to_compat_timeval() ns_to_old_timeval32()
    get_compat_itimerspec64() get_old_itimerspec32()
    put_compat_itimerspec64() put_old_itimerspec32()
    compat_get_timespec64() get_old_timespec32()
    compat_put_timespec64() put_old_timespec32()

    As we already have aliases in place, this patch addresses only the
    instances that are relevant to the system call interface in particular,
    not those that occur in device drivers and other modules. Those
    will get handled separately, while providing the 64-bit version
    of the respective interfaces.

    I'm not renaming the timex, rusage and itimerval structures, as we are
    still debating what the new interface will look like, and whether we
    will need a replacement at all.

    This also doesn't change the names of the syscall entry points, which can
    be done more easily when we actually switch over the 32-bit architectures
    to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
    SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

    Suggested-by: Christoph Hellwig
    Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

23 Aug, 2018

2 commits

  • ipc_getref has still a return value of type "int", matching the atomic_t
    interface of atomic_inc_not_zero()/atomic_add_unless().

    ipc_getref now uses refcount_inc_not_zero, which has a return value of
    type "bool".

    Therefore, update the return code to avoid implicit conversions.

    Link: http://lkml.kernel.org/r/20180712185241.4017-13-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The varable names got a mess, thus standardize them again:

    id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    idx: "index" for the idr lookup
    Right now, some functions use lid, ipc_addid() already uses idx as
    the variable name.
    seq: sequence number, to avoid quick collisions of the user space id
    key: user space key, used for the rhash tree

    Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul