26 Sep, 2019

2 commits

  • Null pointers were assigned to local variables in a few cases as exception
    handling. The jump target “out” was used where no meaningful data
    processing actions should eventually be performed by branches of an if
    statement then. Use an additional jump target for calling dev_kfree_skb()
    directly.

    Return also directly after error conditions were detected when no extra
    clean-up is needed by this function implementation.

    Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • dev_kfree_skb() input parameter validation, thus the test around the call
    is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
    Signed-off-by: Markus Elfring
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     

06 Sep, 2019

1 commit


20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

17 Jul, 2019

1 commit

  • Andreas Christoforou reported:

    UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
    9 * 2305843009213693951 cannot be represented in type 'long int'
    ...
    Call Trace:
    mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
    evict+0x472/0x8c0 fs/inode.c:558
    iput_final fs/inode.c:1547 [inline]
    iput+0x51d/0x8c0 fs/inode.c:1573
    mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
    mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
    vfs_mkobj+0x39e/0x580 fs/namei.c:2892
    prepare_open ipc/mqueue.c:731 [inline]
    do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

    Which could be triggered by:

    struct mq_attr attr = {
    .mq_flags = 0,
    .mq_maxmsg = 9,
    .mq_msgsize = 0x1fffffffffffffff,
    .mq_curmsgs = 0,
    };

    if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
    perror("mq_open");

    mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
    preparing to return -EINVAL. During the cleanup, it calls
    mqueue_evict_inode() which performed resource usage tracking math for
    updating "user", before checking if there was a valid "user" at all
    (which would indicate that the calculations would be sane). Instead,
    delay this check to after seeing a valid "user".

    The overflow was real, but the results went unused, so while the flaw is
    harmless, it's noisy for kernel fuzzers, so just fix it by moving the
    calculation under the non-NULL "user" where it actually gets used.

    Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
    Signed-off-by: Kees Cook
    Reported-by: Andreas Christoforou
    Acked-by: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

26 May, 2019

1 commit


15 May, 2019

3 commits

  • Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
    improve performance of send/recv"). However, consuming a msg in
    msg_get() remains logarithmic (still being better than the case before
    of course). By applying well known techniques to cache pointers we can
    have the node with the highest priority in O(1), which is specially nice
    for the rt cases. Furthermore, some callers can call msg_get() in a
    loop.

    A new msg_tree_erase() helper is also added to encapsulate the tree
    removal and node_cache game. Passes ltp mq testcases.

    Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We already store the current task fo the new waiter before calling
    wq_sleep() in both send and recv paths. Trivially remove the redundant
    assignment.

    Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
    enabled on large memory SMP systems, the pages initialization can take a
    long time, if msgctl10 requests a huge block memory, and it will block
    rcu scheduler, so release cpu actively.

    After adding schedule() in free_msg, free_msg can not be called when
    holding spinlock, so adding msg to a tmp list, and free it out of
    spinlock

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
    rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
    rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
    msgctl10 R running task 21608 32505 2794 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:__is_insn_slot_addr+0xfb/0x250
    Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
    RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
    RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
    RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
    R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
    R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
    kernel_text_address+0xc1/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    create_object+0x380/0x650
    __kmalloc+0x14c/0x2b0
    load_msg+0x38/0x1a0
    do_msgsnd+0x19e/0xcf0
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
    rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
    msgctl10 R running task 21608 32170 32155 0x00000082
    Call Trace:
    preempt_schedule_irq+0x4c/0xb0
    retint_kernel+0x1b/0x2d
    RIP: 0010:lock_acquire+0x4d/0x340
    Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
    RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
    RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
    R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
    is_bpf_text_address+0x32/0xe0
    kernel_text_address+0xec/0x100
    __kernel_text_address+0xe/0x30
    unwind_get_return_address+0x2f/0x50
    __save_stack_trace+0x92/0x100
    save_stack+0x32/0xb0
    __kasan_slab_free+0x130/0x180
    kfree+0xfa/0x2d0
    free_msg+0x24/0x50
    do_msgrcv+0x508/0xe60
    do_syscall_64+0x117/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Davidlohr said:
    "So after releasing the lock, the msg rbtree/list is empty and new
    calls will not see those in the newly populated tmp_msg list, and
    therefore they cannot access the delayed msg freeing pointers, which
    is good. Also the fact that the node_cache is now freed before the
    actual messages seems to be harmless as this is wanted for
    msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
    info->lock the thing is freed anyway so it should not change things"

    Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Li RongQing
    Signed-off-by: Zhang Yu
    Reviewed-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Rongqing
     

02 May, 2019

1 commit


13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

28 Feb, 2019

1 commit

  • Convert the mqueue filesystem to use the filesystem context stuff.

    Notes:

    (1) The relevant ipc namespace is selected in when the context is
    initialised (and it defaults to the current task's ipc namespace).
    The caller can override this before calling vfs_get_tree().

    (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
    mq_internal_mount() create a context, adjust it and then do the rest
    of the mount procedure.

    (3) The lazy mqueue mounting on creation of a new namespace is retained
    from a previous patch, but the avoidance of sget() if no superblock
    yet exists is reverted and the superblock is again keyed on the
    namespace pointer.

    Yes, there was a performance gain in not searching the superblock
    hash, but it's only paid once per ipc namespace - and only if someone
    uses mqueue within that namespace, so I'm not sure it's worth it,
    especially as calling sget() allows avoidance of recursion.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

26 Oct, 2018

1 commit

  • Pull timekeeping updates from Thomas Gleixner:
    "The timers and timekeeping departement provides:

    - Another large y2038 update with further preparations for providing
    the y2038 safe timespecs closer to the syscalls.

    - An overhaul of the SHCMT clocksource driver

    - SPDX license identifier updates

    - Small cleanups and fixes all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
    tick/sched : Remove redundant cpu_online() check
    clocksource/drivers/dw_apb: Add reset control
    clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
    clocksource/drivers: Unify the names to timer-* format
    clocksource/drivers/sh_cmt: Add R-Car gen3 support
    dt-bindings: timer: renesas: cmt: document R-Car gen3 support
    clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
    clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
    clocksource/drivers/sh_cmt: Fixup for 64-bit machines
    clocksource/drivers/sh_tmu: Convert to SPDX identifiers
    clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
    clocksource/drivers/sh_cmt: Convert to SPDX identifiers
    clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
    clocksource: Convert to using %pOFn instead of device_node.name
    tick/broadcast: Remove redundant check
    RISC-V: Request newstat syscalls
    y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
    y2038: socket: Change recvmmsg to use __kernel_timespec
    y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
    y2038: utimes: Rework #ifdef guards for compat syscalls
    ...

    Linus Torvalds
     

03 Oct, 2018

1 commit

  • Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.

    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.

    So split siginfo in two; kernel_siginfo and siginfo. Keeping the
    traditional name for the userspace definition. While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.

    The definition of struct kernel_siginfo I have put in include/signal_types.h

    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.

    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo. The reduction in size comes in a following change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Aug, 2018

1 commit

  • Christoph Hellwig suggested a slightly different path for handling
    backwards compatibility with the 32-bit time_t based system calls:

    Rather than simply reusing the compat_sys_* entry points on 32-bit
    architectures unchanged, we get rid of those entry points and the
    compat_time types by renaming them to something that makes more sense
    on 32-bit architectures (which don't have a compat mode otherwise),
    and then share the entry points under the new name with the 64-bit
    architectures that use them for implementing the compatibility.

    The following types and interfaces are renamed here, and moved
    from linux/compat_time.h to linux/time32.h:

    old new
    --- ---
    compat_time_t old_time32_t
    struct compat_timeval struct old_timeval32
    struct compat_timespec struct old_timespec32
    struct compat_itimerspec struct old_itimerspec32
    ns_to_compat_timeval() ns_to_old_timeval32()
    get_compat_itimerspec64() get_old_itimerspec32()
    put_compat_itimerspec64() put_old_itimerspec32()
    compat_get_timespec64() get_old_timespec32()
    compat_put_timespec64() put_old_timespec32()

    As we already have aliases in place, this patch addresses only the
    instances that are relevant to the system call interface in particular,
    not those that occur in device drivers and other modules. Those
    will get handled separately, while providing the 64-bit version
    of the respective interfaces.

    I'm not renaming the timex, rusage and itimerval structures, as we are
    still debating what the new interface will look like, and whether we
    will need a replacement at all.

    This also doesn't change the names of the syscall entry points, which can
    be done more easily when we actually switch over the 32-bit architectures
    to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
    SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

    Suggested-by: Christoph Hellwig
    Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

20 Apr, 2018

2 commits

  • Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
    take a timespec argument. After we move 32-bit architectures over to
    useing 64-bit time_t based syscalls, we need seperate entry points for
    the old 32-bit based interfaces.

    This changes the #ifdef guards for the existing 32-bit compat syscalls
    to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
    enabled on all existing 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • This is a preparatation for changing over __kernel_timespec to 64-bit
    times, which involves assigning new system call numbers for mq_timedsend(),
    mq_timedreceive() and semtimedop() for compatibility with future y2038
    proof user space.

    The existing ABIs will remain available through compat code.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

25 Mar, 2018

1 commit

  • This reverts commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21.

    Aleksa Sarai writes:
    > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
    >
    > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
    > which was introduced by 36735a6a2b5e ("mqueue: switch to on-demand
    > creation of internal mount").
    >
    > Basically, the reproducer boils down to being able to mount mqueue if
    > you create a new user namespace, even if you don't unshare the IPC
    > namespace.
    >
    > Previously this was not possible, and you would get an -EPERM. The mount
    > is the *host* mqueue mount, which is being cached and just returned from
    > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
    > it was intentional -- since I'm not familiar with mqueue).
    >
    > To me it looks like there is a missing permission check. I've included a
    > patch below that I've compile-tested, and should block the above case.
    > Can someone please tell me if I'm missing something? Is this actually
    > safe?
    >
    > [1]: https://github.com/docker/docker/issues/36674

    The issue is a lot deeper than a missing permission check. sb->s_user_ns
    was is improperly set as well. So in addition to the filesystem being
    mounted when it should not be mounted, so things are not allow that should
    be.

    We are practically to the release of 4.16 and there is no agreement between
    Al Viro and myself on what the code should looks like to fix things properly.
    So revert the code to what it was before so that we can take our time
    and discuss this properly.

    Fixes: 36735a6a2b5e ("mqueue: switch to on-demand creation of internal mount")
    Reported-by: Felix Abecassis
    Reported-by: Aleksa Sarai
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Feb, 2018

1 commit

  • Previous behavior added tasks to the work queue using the static_prio
    value instead of the dynamic priority value in prio. This caused RT tasks
    to be added to the work queue in a FIFO manner rather than by priority.
    Normal tasks were handled by priority.

    This fix utilizes the dynamic priority of the task to ensure that both RT
    and normal tasks are added to the work queue in priority order. Utilizing
    the dynamic priority (prio) rather than the base priority (normal_prio)
    was chosen to ensure that if a task had a boosted priority when it was
    added to the work queue, it would be woken sooner to to ensure that it
    releases any other locks it may be holding in a more timely manner. It is
    understood that the task could have a lower priority when it wakes than
    when it was added to the queue in this (unlikely) case.

    Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu
    Signed-off-by: Jonathan Haws
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Davidlohr Bueso
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Deepa Dinamani
    Cc: Thomas Gleixner
    Cc: Sebastian Andrzej Siewior
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Haws
     

31 Jan, 2018

2 commits

  • Pull mqueue/bpf vfs cleanups from Al Viro:
    "mqueue and bpf go through rather painful and similar contortions to
    create objects in their dentry trees. Provide a primitive for doing
    that without abusing ->mknod(), switch bpf and mqueue to it.

    Another mqueue-related thing that has ended up in that branch is
    on-demand creation of internal mount (based upon the work of Giuseppe
    Scrivano)"

    * 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: switch to on-demand creation of internal mount
    tidy do_mq_open() up a bit
    mqueue: clean prepare_open() up
    do_mq_open(): move all work prior to dentry_open() into a helper
    mqueue: fold mq_attr_ok() into mqueue_get_inode()
    move dentry_open() calls up into do_mq_open()
    mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
    bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
    new primitive: vfs_mkobj()

    Linus Torvalds
     
  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

13 Jan, 2018

1 commit

  • Call clear_siginfo to ensure stack allocated siginfos are fully
    initialized before being passed to the signal sending functions.

    This ensures that if there is the kind of confusion documented by
    TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
    data to userspace when the kernel generates a signal with SI_USER but
    the copy to userspace assumes it is a different kind of signal, and
    different fields are initialized.

    This also prepares the way for turning copy_siginfo_to_user
    into a copy_to_user, by removing the need in many cases to perform
    a field by field copy simply to skip the uninitialized fields.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 Jan, 2018

7 commits


28 Nov, 2017

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • This is a pure automated search-and-replace of the internal kernel
    superblock flags.

    The s_flags are now called SB_*, with the names and the values for the
    moment mirroring the MS_* flags that they're equivalent to.

    Note how the MS_xyz flags are the ones passed to the mount system call,
    while the SB_xyz flags are what we then use in sb->s_flags.

    The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
    include/linux/fs.h include/uapi/linux/bfs_fs.h \
    security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
    DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
    POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
    I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
    ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Sep, 2017

1 commit

  • struct timespec is not y2038 safe. Replace
    all uses of timespec by y2038 safe struct timespec64.

    Even though timespec is used here to represent timeouts,
    replace these with timespec64 so that it facilitates
    in verification by creating a y2038 safe kernel image
    that is free of timespec.

    The syscall interfaces themselves are not changed as part
    of the patch. They will be part of a different series.

    Signed-off-by: Deepa Dinamani
    Cc: Paul Moore
    Cc: Richard Guy Briggs
    Reviewed-by: Richard Guy Briggs
    Reviewed-by: Arnd Bergmann
    Acked-by: Paul Moore
    Signed-off-by: Al Viro

    Deepa Dinamani
     

10 Jul, 2017

1 commit

  • The retry logic for netlink_attachskb() inside sys_mq_notify()
    is nasty and vulnerable:

    1) The sock refcnt is already released when retry is needed
    2) The fd is controllable by user-space because we already
    release the file refcnt

    so we when retry but the fd has been just closed by user-space
    during this small window, we end up calling netlink_detachskb()
    on the error path which releases the sock again, later when
    the user-space closes this socket a use-after-free could be
    triggered.

    Setting 'sock' to NULL here should be sufficient to fix it.

    Reported-by: GeneBlue
    Signed-off-by: Cong Wang
    Cc: Andrew Morton
    Cc: Manfred Spraul
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Cong Wang
     

05 Jul, 2017

1 commit


02 Mar, 2017

3 commits


28 Feb, 2017

1 commit