01 Oct, 2020

2 commits

  • [ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

    This changes do_io_accounting to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/io for instance.

    This should be safe, as the credentials are only used for reading.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     
  • [ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

    This changes lock_trace to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/stack for instance.

    This should be safe, as the credentials are only used for reading,
    and task->mm is updated on execve under the new exec_update_mutex.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     

17 Jul, 2019

3 commits

  • This fixes two problems reported with the cmdline simplification and
    cleanup last year:

    - the setproctitle() special cases didn't quite match the original
    semantics, and it can be noticeable:

    https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/

    - it could leak an uninitialized byte from the temporary buffer under
    the right (wrong) circustances:

    https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/

    It rewrites the logic entirely, splitting it into two separate commits
    (and two separate functions) for the two different cases ("unedited
    cmdline" vs "setproctitle() has been used to change the command line").

    * proc-cmdline:
    /proc//cmdline: add back the setproctitle() special case
    /proc//cmdline: remove all the special cases

    Linus Torvalds
     
  • This makes the setproctitle() special case very explicit indeed, and
    handles it with a separate helper function entirely. In the process, it
    re-instates the original semantics of simply stopping at the first NUL
    character when the original last NUL character is no longer there.

    [ The original semantics can still be seen in mm/util.c: get_cmdline()
    that is limited to a fixed-size buffer ]

    This makes the logic about when we use the string lengths etc much more
    obvious, and makes it easier to see what we do and what the two very
    different cases are.

    Note that even when we allow walking past the end of the argument array
    (because the setproctitle() might have overwritten and overflowed the
    original argv[] strings), we only allow it when it overflows into the
    environment region if it is immediately adjacent.

    [ Fixed for missing 'count' checks noted by Alexey Izbyshev ]

    Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
    Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function")
    Cc: Jakub Jankowski
    Cc: Alexey Dobriyan
    Cc: Alexey Izbyshev
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Start off with a clean slate that only reads exactly from arg_start to
    arg_end, without any oddities. This simplifies the code and in the
    process removes the case that caused us to potentially leak an
    uninitialized byte from the temporary kernel buffer.

    Note that in order to start from scratch with an understandable base,
    this simplifies things _too_ much, and removes all the legacy logic to
    handle setproctitle() having changed the argument strings.

    We'll add back those special cases very differently in the next commit.

    Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
    Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
    Cc: Alexey Izbyshev
    Cc: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Jul, 2019

3 commits

  • Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to
    mem_exclusive cpuset") introduces a heuristic where a potential
    oom-killer victim is skipped if the intersection of the potential victim
    and the current (the process triggered the oom) is empty based on the
    reason that killing such victim most probably will not help the current
    allocating process.

    However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the
    heuristic to just decrease the oom_badness scores of such potential
    victim based on the reason that the cpuset of such processes might have
    changed and previously they may have allocated memory on mems where the
    current allocating process can allocate from.

    Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a
    side effect as the oom_badness is also exposed to the user space through
    /proc/[pid]/oom_score, so, readers with different cpusets can read
    different oom_score of the same process.

    Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same
    cpuset") fixed the side effect introduced by 7887a3da753e by moving the
    cpuset intersection back to only oom-killer context and out of
    oom_badness. However the combination of ab290adbaf8f ("oom: make
    oom_unkillable_task() helper function") and 26ebc984913b ("oom:
    /proc//oom_score treat kernel thread honestly") unintentionally
    brought back the cpuset intersection check into the oom_badness
    calculation function.

    Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
    oom context is also doing cpuset/mempolicy intersection which is quite
    wrong and is caught by syzcaller with the following report:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    Call Trace:
    oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
    mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
    select_bad_process mm/oom_kill.c:374 [inline]
    out_of_memory mm/oom_kill.c:1088 [inline]
    out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
    mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
    mem_cgroup_oom mm/memcontrol.c:1905 [inline]
    try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
    mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
    mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
    do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
    do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
    wp_huge_pmd mm/memory.c:3793 [inline]
    __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
    handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
    do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
    __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
    do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
    page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
    RIP: 0033:0x400590
    Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
    8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 06 e9 1e 01
    00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
    RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
    RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
    R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
    Modules linked in:
    ---[ end trace a65689219582ffff ]---
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600

    The fix is to decouple the cpuset/mempolicy intersection check from
    oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
    only done in the global oom context.

    [shakeelb@google.com: change function name and update comment]
    Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • oom_unkillable_task() can be called from three different contexts i.e.
    global OOM, memcg OOM and oom_score procfs interface. At the moment
    oom_unkillable_task() does a task_in_mem_cgroup() check on the given
    process. Since there is no reason to perform task_in_mem_cgroup()
    check for global OOM and oom_score procfs interface, those contexts
    provide NULL memcg and skips the task_in_mem_cgroup() check. However
    for memcg OOM context, the oom_unkillable_task() is always called from
    mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
    redundant and effectively dead code. So, just remove the
    task_in_mem_cgroup() check altogether.

    Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Signed-off-by: Tetsuo Handa
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    It seems ->d_revalidate() could return any error (except ECHILD) to abort
    validation and pass error as result of lookup sequence.

    [akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
    Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

09 Jul, 2019

1 commit

  • Pull x86 AVX512 status update from Ingo Molnar:
    "This adds a new ABI that the main scheduler probably doesn't want to
    deal with but HPC job schedulers might want to use: the
    AVX512_elapsed_ms field in the new /proc//arch_status task status
    file, which allows the user-space job scheduler to cluster such tasks,
    to avoid turbo frequency drops"

    * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Documentation/filesystems/proc.txt: Add arch_status file
    x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
    proc: Add /proc//arch_status

    Linus Torvalds
     

27 Jun, 2019

1 commit


12 Jun, 2019

1 commit

  • Exposing architecture specific per process information is useful for
    various reasons. An example is the AVX512 usage on x86 which is important
    for task placement for power/performance optimizations.

    Adding this information to the existing /prcc/pid/status file would be the
    obvious choise, but it has been agreed on that a explicit arch_status file
    is better in separating the generic and architecture specific information.

    [ tglx: Massage changelog ]

    Signed-off-by: Aubrey Li
    Signed-off-by: Thomas Gleixner
    Acked-by: Andrew Morton
    Cc: peterz@infradead.org
    Cc: hpa@zytor.com
    Cc: ak@linux.intel.com
    Cc: tim.c.chen@linux.intel.com
    Cc: dave.hansen@intel.com
    Cc: arjan@linux.intel.com
    Cc: adobriyan@gmail.com
    Cc: aubrey.li@intel.com
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Arjan van de Ven
    Cc: Alexey Dobriyan
    Cc: Linux API
    Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey.li@linux.intel.com

    Aubrey Li
     

15 May, 2019

1 commit

  • The name clear_all_latency_tracing is misleading, in fact which only
    clear per task's latency_record[], and we do have another function named
    clear_global_latency_tracing which clear the global latency_record[]
    buffer.

    Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
    Signed-off-by: Lin Feng
    Cc: Alexey Dobriyan
    Cc: Fabian Frederick
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng
     

08 May, 2019

1 commit

  • Pull selinux updates from Paul Moore:
    "We've got a few SELinux patches for the v5.2 merge window, the
    highlights are below:

    - Add LSM hooks, and the SELinux implementation, for proper labeling
    of kernfs. While we are only including the SELinux implementation
    here, the rest of the LSM folks have given the hooks a thumbs-up.

    - Update the SELinux mdp (Make Dummy Policy) script to actually work
    on a modern system.

    - Disallow userspace to change the LSM credentials via
    /proc/self/attr when the task's credentials are already overridden.

    The change was made in procfs because all the LSM folks agreed this
    was the Right Thing To Do and duplicating it across each LSM was
    going to be annoying"

    * tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    proc: prevent changes to overridden credentials
    selinux: Check address length before reading address family
    kernfs: fix xattr name handling in LSM helpers
    MAINTAINERS: update SELinux file patterns
    selinux: avoid uninitialized variable warning
    selinux: remove useless assignments
    LSM: lsm_hooks.h - fix missing colon in docstring
    selinux: Make selinux_kernfs_init_security static
    kernfs: initialize security of newly created nodes
    selinux: implement the kernfs_init_security hook
    LSM: add new hook for kernfs node initialization
    kernfs: use simple_xattrs for security attributes
    selinux: try security xattr after genfs for kernfs filesystems
    kernfs: do not alloc iattrs in kernfs_xattr_get
    kernfs: clean up struct kernfs_iattrs
    scripts/selinux: fix build
    selinux: use kernel linux/socket.h for genheaders and mdp
    scripts/selinux: modernize mdp

    Linus Torvalds
     

29 Apr, 2019

2 commits

  • Prevent userspace from changing the the /proc/PID/attr values if the
    task's credentials are currently overriden. This not only makes sense
    conceptually, it also prevents some really bizarre error cases caused
    when trying to commit credentials to a task with overridden
    credentials.

    Cc:
    Reported-by: "chengjian (D)"
    Signed-off-by: Paul Moore
    Acked-by: John Johansen
    Acked-by: James Morris
    Acked-by: Casey Schaufler

    Paul Moore
     
  • Replace the indirection through struct stack_trace with an invocation of
    the storage array based interface.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexey Dobriyan
    Reviewed-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: David Rientjes
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: kasan-dev@googlegroups.com
    Cc: Mike Rapoport
    Cc: Akinobu Mita
    Cc: Christoph Hellwig
    Cc: iommu@lists.linux-foundation.org
    Cc: Robin Murphy
    Cc: Marek Szyprowski
    Cc: Johannes Thumshirn
    Cc: David Sterba
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: linux-btrfs@vger.kernel.org
    Cc: dm-devel@redhat.com
    Cc: Mike Snitzer
    Cc: Alasdair Kergon
    Cc: Daniel Vetter
    Cc: intel-gfx@lists.freedesktop.org
    Cc: Joonas Lahtinen
    Cc: Maarten Lankhorst
    Cc: dri-devel@lists.freedesktop.org
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Tom Zanussi
    Cc: Miroslav Benes
    Cc: linux-arch@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190425094801.589304463@linutronix.de

    Thomas Gleixner
     

15 Apr, 2019

1 commit

  • No architecture terminates the stack trace with ULONG_MAX anymore. The
    consumer terminates on the first zero entry or at the number of entries, so
    no functional change.

    Remove the cruft.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Link: https://lkml.kernel.org/r/20190410103644.853527514@linutronix.de

    Thomas Gleixner
     

04 Apr, 2019

1 commit

  • task_current_syscall() has a single user that passes in 6 for maxargs, which
    is the maximum arguments that can be used to get system calls from
    syscall_get_arguments(). Instead of passing in a number of arguments to
    grab, just get 6 arguments. The args argument even specifies that it's an
    array of 6 items.

    This will also allow changing syscall_get_arguments() to not get a variable
    number of arguments, but always grab 6.

    Linus also suggested not passing in a bunch of arguments to
    task_current_syscall() but to instead pass in a pointer to a structure, and
    just fill the structure. struct seccomp_data has almost all the parameters
    that is needed except for the stack pointer (sp). As seccomp_data is part of
    uapi, and I'm afraid to change it, a new structure was created
    "syscall_info", which includes seccomp_data and adds the "sp" field.

    Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org

    Cc: Andy Lutomirski
    Cc: Alexey Dobriyan
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (Red Hat)
     

17 Mar, 2019

1 commit

  • Pull pidfd system call from Christian Brauner:
    "This introduces the ability to use file descriptors from /proc//
    as stable handles on struct pid. Even if a pid is recycled the handle
    will not change. For a start these fds can be used to send signals to
    the processes they refer to.

    With the ability to use /proc/ fds as stable handles on struct
    pid we can fix a long-standing issue where after a process has exited
    its pid can be reused by another process. If a caller sends a signal
    to a reused pid it will end up signaling the wrong process.

    With this patchset we enable a variety of use cases. One obvious
    example is that we can now safely delegate an important part of
    process management - sending signals - to processes other than the
    parent of a given process by sending file descriptors around via scm
    rights and not fearing that the given process will have been recycled
    in the meantime. It also allows for easy testing whether a given
    process is still alive or not by sending signal 0 to a pidfd which is
    quite handy.

    There has been some interest in this feature e.g. from systems
    management (systemd, glibc) and container managers. I have requested
    and gotten comments from glibc to make sure that this syscall is
    suitable for their needs as well. In the future I expect it to take on
    most other pid-based signal syscalls. But such features are left for
    the future once they are needed.

    This has been sitting in linux-next for quite a while and has not
    caused any issues. It comes with selftests which verify basic
    functionality and also test that a recycled pid cannot be signaled via
    a pidfd.

    Jon has written about a prior version of this patchset. It should
    cover the basic functionality since not a lot has changed since then:

    https://lwn.net/Articles/773459/

    The commit message for the syscall itself is extensively documenting
    the syscall, including it's functionality and extensibility"

    * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests: add tests for pidfd_send_signal()
    signal: add pidfd_send_signal() syscall

    Linus Torvalds
     

13 Mar, 2019

2 commits

  • The new generic radix trees have a simpler API and implementation, and
    no limitations on number of elements, so all flex_array users are being
    converted

    Link: http://lkml.kernel.org/r/20181217131929.11727-6-kent.overstreet@gmail.com
    Signed-off-by: Kent Overstreet
    Reviewed-by: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Paris
    Cc: Marcelo Ricardo Leitner
    Cc: Matthew Wilcox
    Cc: Neil Horman
    Cc: Paul Moore
    Cc: Pravin B Shelar
    Cc: Shaohua Li
    Cc: Stephen Smalley
    Cc: Vlad Yasevich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Compilers like to transform loops like

    for (i = 0; i < n; i++) {
    [use p[i]]
    }

    into
    for (p = p0; p < end; p++) {
    ...
    }

    Do it by hand, so that it results in overall simpler loop
    and smaller code.

    Space savings:

    $ ./scripts/bloat-o-meter ../vmlinux-001 ../obj/vmlinux
    add/remove: 0/0 grow/shrink: 2/1 up/down: 4/-9 (-5)
    Function old new delta
    proc_tid_base_lookup 17 19 +2
    proc_tgid_base_lookup 17 19 +2
    proc_pident_lookup 179 170 -9

    The same could be done to proc_pident_readdir(), but the code becomes
    bigger for some reason.

    [sfr@canb.auug.org.au: merge fix for proc_pident_lookup() API change]
    Link: http://lkml.kernel.org/r/20190131160135.4a8ae70b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20190114200422.GB9680@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Cc: James Morris
    Cc: Alexey Dobriyan
    Cc: Casey Schaufler
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 Mar, 2019

2 commits

  • Pull audit updates from Paul Moore:
    "A lucky 13 audit patches for v5.1.

    Despite the rather large diffstat, most of the changes are from two
    bug fix patches that move code from one Kconfig option to another.

    Beyond that bit of churn, the remaining changes are largely cleanups
    and bug-fixes as we slowly march towards container auditing. It isn't
    all boring though, we do have a couple of new things: file
    capabilities v3 support, and expanded support for filtering on
    filesystems to solve problems with remote filesystems.

    All changes pass the audit-testsuite. Please merge for v5.1"

    * tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: mark expected switch fall-through
    audit: hide auditsc_get_stamp and audit_serial prototypes
    audit: join tty records to their syscall
    audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL
    audit: remove unused actx param from audit_rule_match
    audit: ignore fcaps on umount
    audit: clean up AUDITSYSCALL prototypes and stubs
    audit: more filter PATH records keyed on filesystem magic
    audit: add support for fcaps v3
    audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT
    audit: add syscall information to CONFIG_CHANGE records
    audit: hand taken context to audit_kill_trees for syscall logging
    audit: give a clue what CONFIG_CHANGE op was involved

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:

    - Extend LSM stacking to allow sharing of cred, file, ipc, inode, and
    task blobs. This paves the way for more full-featured LSMs to be
    merged, and is specifically aimed at LandLock and SARA LSMs. This
    work is from Casey and Kees.

    - There's a new LSM from Micah Morton: "SafeSetID gates the setid
    family of syscalls to restrict UID/GID transitions from a given
    UID/GID to only those approved by a system-wide whitelist." This
    feature is currently shipping in ChromeOS.

    * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (62 commits)
    keys: fix missing __user in KEYCTL_PKEY_QUERY
    LSM: Update list of SECURITYFS users in Kconfig
    LSM: Ignore "security=" when "lsm=" is specified
    LSM: Update function documentation for cap_capable
    security: mark expected switch fall-throughs and add a missing break
    tomoyo: Bump version.
    LSM: fix return value check in safesetid_init_securityfs()
    LSM: SafeSetID: add selftest
    LSM: SafeSetID: remove unused include
    LSM: SafeSetID: 'depend' on CONFIG_SECURITY
    LSM: Add 'name' field for SafeSetID in DEFINE_LSM
    LSM: add SafeSetID module that gates setid calls
    LSM: add SafeSetID module that gates setid calls
    tomoyo: Allow multiple use_group lines.
    tomoyo: Coding style fix.
    tomoyo: Swicth from cred->security to task_struct->security.
    security: keys: annotate implicit fall throughs
    security: keys: annotate implicit fall throughs
    security: keys: annotate implicit fall through
    capabilities:: annotate implicit fall through
    ...

    Linus Torvalds
     

06 Mar, 2019

3 commits

  • seq_printf() without format specifiers == faster seq_puts()

    Link: http://lkml.kernel.org/r/20190114200545.GC9680@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • [adobriyan@gmail.com: delete "extern" from prototype]
    Link: http://lkml.kernel.org/r/20190114195635.GA9372@avx2
    Signed-off-by: Zhikang Zhang
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhikang Zhang
     
  • The kill() syscall operates on process identifiers (pid). After a process
    has exited its pid can be reused by another process. If a caller sends a
    signal to a reused pid it will end up signaling the wrong process. This
    issue has often surfaced and there has been a push to address this problem [1].

    This patch uses file descriptors (fd) from proc/ as stable handles on
    struct pid. Even if a pid is recycled the handle will not change. The fd
    can be used to send signals to the process it refers to.
    Thus, the new syscall pidfd_send_signal() is introduced to solve this
    problem. Instead of pids it operates on process fds (pidfd).

    /* prototype and argument /*
    long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);

    /* syscall number 424 */
    The syscall number was chosen to be 424 to align with Arnd's rework in his
    y2038 to minimize merge conflicts (cf. [25]).

    In addition to the pidfd and signal argument it takes an additional
    siginfo_t and flags argument. If the siginfo_t argument is NULL then
    pidfd_send_signal() is equivalent to kill(, ). If it
    is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
    The flags argument is added to allow for future extensions of this syscall.
    It currently needs to be passed as 0. Failing to do so will cause EINVAL.

    /* pidfd_send_signal() replaces multiple pid-based syscalls */
    The pidfd_send_signal() syscall currently takes on the job of
    rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
    positive pid is passed to kill(2). It will however be possible to also
    replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

    /* sending signals to threads (tid) and process groups (pgid) */
    Specifically, the pidfd_send_signal() syscall does currently not operate on
    process groups or threads. This is left for future extensions.
    In order to extend the syscall to allow sending signal to threads and
    process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
    PIDFD_TYPE_TID) should be added. This implies that the flags argument will
    determine what is signaled and not the file descriptor itself. Put in other
    words, grouping in this api is a property of the flags argument not a
    property of the file descriptor (cf. [13]). Clarification for this has been
    requested by Eric (cf. [19]).
    When appropriate extensions through the flags argument are added then
    pidfd_send_signal() can additionally replace the part of kill(2) which
    operates on process groups as well as the tgkill(2) and
    rt_tgsigqueueinfo(2) syscalls.
    How such an extension could be implemented has been very roughly sketched
    in [14], [15], and [16]. However, this should not be taken as a commitment
    to a particular implementation. There might be better ways to do it.
    Right now this is intentionally left out to keep this patchset as simple as
    possible (cf. [4]).

    /* naming */
    The syscall had various names throughout iterations of this patchset:
    - procfd_signal()
    - procfd_send_signal()
    - taskfd_send_signal()
    In the last round of reviews it was pointed out that given that if the
    flags argument decides the scope of the signal instead of different types
    of fds it might make sense to either settle for "procfd_" or "pidfd_" as
    prefix. The community was willing to accept either (cf. [17] and [18]).
    Given that one developer expressed strong preference for the "pidfd_"
    prefix (cf. [13]) and with other developers less opinionated about the name
    we should settle for "pidfd_" to avoid further bikeshedding.

    The "_send_signal" suffix was chosen to reflect the fact that the syscall
    takes on the job of multiple syscalls. It is therefore intentional that the
    name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
    fomer because it might imply that pidfd_send_signal() is a replacement for
    kill(2), and not the latter because it is a hassle to remember the correct
    spelling - especially for non-native speakers - and because it is not
    descriptive enough of what the syscall actually does. The name
    "pidfd_send_signal" makes it very clear that its job is to send signals.

    /* zombies */
    Zombies can be signaled just as any other process. No special error will be
    reported since a zombie state is an unreliable state (cf. [3]). However,
    this can be added as an extension through the @flags argument if the need
    ever arises.

    /* cross-namespace signals */
    The patch currently enforces that the signaler and signalee either are in
    the same pid namespace or that the signaler's pid namespace is an ancestor
    of the signalee's pid namespace. This is done for the sake of simplicity
    and because it is unclear to what values certain members of struct
    siginfo_t would need to be set to (cf. [5], [6]).

    /* compat syscalls */
    It became clear that we would like to avoid adding compat syscalls
    (cf. [7]). The compat syscall handling is now done in kernel/signal.c
    itself by adding __copy_siginfo_from_user_generic() which lets us avoid
    compat syscalls (cf. [8]). It should be noted that the addition of
    __copy_siginfo_from_user_any() is caused by a bug in the original
    implementation of rt_sigqueueinfo(2) (cf. 12).
    With upcoming rework for syscall handling things might improve
    significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
    any additional callers.

    /* testing */
    This patch was tested on x64 and x86.

    /* userspace usage */
    An asciinema recording for the basic functionality can be found under [9].
    With this patch a process can be killed via:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
    unsigned int flags)
    {
    #ifdef __NR_pidfd_send_signal
    return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
    #else
    return -ENOSYS;
    #endif
    }

    int main(int argc, char *argv[])
    {
    int fd, ret, saved_errno, sig;

    if (argc < 3)
    exit(EXIT_FAILURE);

    fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
    if (fd < 0) {
    printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
    exit(EXIT_FAILURE);
    }

    sig = atoi(argv[2]);

    printf("Sending signal %d to process %s\n", sig, argv[1]);
    ret = do_pidfd_send_signal(fd, sig, NULL, 0);

    saved_errno = errno;
    close(fd);
    errno = saved_errno;

    if (ret < 0) {
    printf("%s - Failed to send signal %d to process %s\n",
    strerror(errno), sig, argv[1]);
    exit(EXIT_FAILURE);
    }

    exit(EXIT_SUCCESS);
    }

    /* Q&A
    * Given that it seems the same questions get asked again by people who are
    * late to the party it makes sense to add a Q&A section to the commit
    * message so it's hopefully easier to avoid duplicate threads.
    *
    * For the sake of progress please consider these arguments settled unless
    * there is a new point that desperately needs to be addressed. Please make
    * sure to check the links to the threads in this commit message whether
    * this has not already been covered.
    */
    Q-01: (Florian Weimer [20], Andrew Morton [21])
    What happens when the target process has exited?
    A-01: Sending the signal will fail with ESRCH (cf. [22]).

    Q-02: (Andrew Morton [21])
    Is the task_struct pinned by the fd?
    A-02: No. A reference to struct pid is kept. struct pid - as far as I
    understand - was created exactly for the reason to not require to
    pin struct task_struct (cf. [22]).

    Q-03: (Andrew Morton [21])
    Does the entire procfs directory remain visible? Just one entry
    within it?
    A-03: The same thing that happens right now when you hold a file descriptor
    to /proc/ open (cf. [22]).

    Q-04: (Andrew Morton [21])
    Does the pid remain reserved?
    A-04: No. This patchset guarantees a stable handle not that pids are not
    recycled (cf. [22]).

    Q-05: (Andrew Morton [21])
    Do attempts to signal that fd return errors?
    A-05: See {Q,A}-01.

    Q-06: (Andrew Morton [22])
    Is there a cleaner way of obtaining the fd? Another syscall perhaps.
    A-06: Userspace can already trivially retrieve file descriptors from procfs
    so this is something that we will need to support anyway. Hence,
    there's no immediate need to add another syscalls just to make
    pidfd_send_signal() not dependent on the presence of procfs. However,
    adding a syscalls to get such file descriptors is planned for a
    future patchset (cf. [22]).

    Q-07: (Andrew Morton [21] and others)
    This fd-for-a-process sounds like a handy thing and people may well
    think up other uses for it in the future, probably unrelated to
    signals. Are the code and the interface designed to permit such
    future applications?
    A-07: Yes (cf. [22]).

    Q-08: (Andrew Morton [21] and others)
    Now I think about it, why a new syscall? This thing is looking
    rather like an ioctl?
    A-08: This has been extensively discussed. It was agreed that a syscall is
    preferred for a variety or reasons. Here are just a few taken from
    prior threads. Syscalls are safer than ioctl()s especially when
    signaling to fds. Processes are a core kernel concept so a syscall
    seems more appropriate. The layout of the syscall with its four
    arguments would require the addition of a custom struct for the
    ioctl() thereby causing at least the same amount or even more
    complexity for userspace than a simple syscall. The new syscall will
    replace multiple other pid-based syscalls (see description above).
    The file-descriptors-for-processes concept introduced with this
    syscall will be extended with other syscalls in the future. See also
    [22], [23] and various other threads already linked in here.

    Q-09: (Florian Weimer [24])
    What happens if you use the new interface with an O_PATH descriptor?
    A-09:
    pidfds opened as O_PATH fds cannot be used to send signals to a
    process (cf. [2]). Signaling processes through pidfds is the
    equivalent of writing to a file. Thus, this is not an operation that
    operates "purely at the file descriptor level" as required by the
    open(2) manpage. See also [4].

    /* References */
    [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
    [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
    [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
    [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
    [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
    [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
    [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
    [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
    [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
    [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
    [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
    [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
    [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
    [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
    [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
    [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
    [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
    [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
    [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
    [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
    [23]: https://lwn.net/Articles/773459/
    [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

    Cc: "Eric W. Biederman"
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Florian Weimer
    Signed-off-by: Christian Brauner
    Reviewed-by: Tycho Andersen
    Reviewed-by: Kees Cook
    Reviewed-by: David Howells
    Acked-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Acked-by: Serge Hallyn
    Acked-by: Aleksa Sarai

    Christian Brauner
     

22 Feb, 2019

1 commit

  • Tetsuo has reported that creating a thousands of processes sharing MM
    without SIGHAND (aka alien threads) and setting
    /proc//oom_score_adj will swamp the kernel log and takes ages [1]
    to finish. This is especially worrisome that all that printing is done
    under RCU lock and this can potentially trigger RCU stall or softlockup
    detector.

    The primary reason for the printk was to catch potential users who might
    depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
    processes sharing mm have same view of oom_score_adj") but after more
    than 2 years without a single report I guess it is safe to simply remove
    the printk altogether.

    The next step should be moving oom_score_adj over to the mm struct and
    remove all the tasks crawling as suggested by [2]

    [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
    [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Yong-Taek Lee
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 Jan, 2019

1 commit

  • loginuid and sessionid (and audit_log_session_info) should be part of
    CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
    CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
    which are otherwise dependent on AUDITSYSCALL.

    Please see github issue
    https://github.com/linux-audit/audit-kernel/issues/104

    Signed-off-by: Richard Guy Briggs
    [PM: tweaked subject line for better grep'ing]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

09 Jan, 2019

1 commit

  • Back in 2007 I made what turned out to be a rather serious
    mistake in the implementation of the Smack security module.
    The SELinux module used an interface in /proc to manipulate
    the security context on processes. Rather than use a similar
    interface, I used the same interface. The AppArmor team did
    likewise. Now /proc/.../attr/current will tell you the
    security "context" of the process, but it will be different
    depending on the security module you're using.

    This patch provides a subdirectory in /proc/.../attr for
    Smack. Smack user space can use the "current" file in
    this subdirectory and never have to worry about getting
    SELinux attributes by mistake. Programs that use the
    old interface will continue to work (or fail, as the case
    may be) as before.

    The proposed S.A.R.A security module is dependent on
    the mechanism to create its own attr subdirectory.

    The original implementation is by Kees Cook.

    Signed-off-by: Casey Schaufler
    Reviewed-by: Kees Cook
    Signed-off-by: Kees Cook

    Casey Schaufler
     

05 Jan, 2019

2 commits

  • Header of /proc/*/limits is a fixed string, so print it directly without
    formatting specifiers.

    Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
    in its effective capability set, but the current check looks in the root
    namespace instead of the process' user namespace. Since a process is
    allowed to do other activities controlled by CAP_SYS_NICE inside a
    namespace, it should also be able to adjust timerslack_ns.

    Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com
    Signed-off-by: Benjamin Gordon
    Acked-by: "Eric W. Biederman"
    Cc: John Stultz
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: "Serge E. Hallyn"
    Cc: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Todd Kjos
    Cc: Colin Cross
    Cc: Nick Kralevich
    Cc: Dmitry Shmidt
    Cc: Elliott Hughes
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Gordon
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

02 Nov, 2018

1 commit

  • Pull stackleak gcc plugin from Kees Cook:
    "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
    was ported from grsecurity by Alexander Popov. It provides efficient
    stack content poisoning at syscall exit. This creates a defense
    against at least two classes of flaws:

    - Uninitialized stack usage. (We continue to work on improving the
    compiler to do this in other ways: e.g. unconditional zero init was
    proposed to GCC and Clang, and more plugin work has started too).

    - Stack content exposure. By greatly reducing the lifetime of valid
    stack contents, exposures via either direct read bugs or unknown
    cache side-channels become much more difficult to exploit. This
    complements the existing buddy and heap poisoning options, but
    provides the coverage for stacks.

    The x86 hooks are included in this series (which have been reviewed by
    Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
    been merged through the arm64 tree (written by Laura Abbott and
    reviewed by Mark Rutland and Will Deacon).

    With VLAs having been removed this release, there is no need for
    alloca() protection, so it has been removed from the plugin"

    * tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    arm64: Drop unneeded stackleak_check_alloca()
    stackleak: Allow runtime disabling of kernel stack erasing
    doc: self-protection: Add information about STACKLEAK feature
    fs/proc: Show STACKLEAK metrics in the /proc file system
    lkdtm: Add a test for STACKLEAK
    gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
    x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

    Linus Torvalds
     

06 Oct, 2018

1 commit

  • Currently, you can use /proc/self/task/*/stack to cause a stack walk on
    a task you control while it is running on another CPU. That means that
    the stack can change under the stack walker. The stack walker does
    have guards against going completely off the rails and into random
    kernel memory, but it can interpret random data from your kernel stack
    as instruction pointers and stack pointers. This can cause exposure of
    kernel stack contents to userspace.

    Restrict the ability to inspect kernel stacks of arbitrary tasks to root
    in order to prevent a local attacker from exploiting racy stack unwinding
    to leak kernel task stack contents. See the added comment for a longer
    rationale.

    There don't seem to be any users of this userspace API that can't
    gracefully bail out if reading from the file fails. Therefore, I believe
    that this change is unlikely to break things. In the case that this patch
    does end up needing a revert, the next-best solution might be to fake a
    single-entry stack based on wchan.

    Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
    Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Ken Chen
    Cc: Will Deacon
    Cc: Laura Abbott
    Cc: Andy Lutomirski
    Cc: Catalin Marinas
    Cc: Josh Poimboeuf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H . Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

05 Sep, 2018

1 commit

  • Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
    tasks via the /proc file system. In particular, /proc//stack_depth
    shows the maximum kernel stack consumption for the current and previous
    syscalls. Although this information is not precise, it can be useful for
    estimating the STACKLEAK performance impact for your workloads.

    Suggested-by: Ingo Molnar
    Signed-off-by: Alexander Popov
    Tested-by: Laura Abbott
    Signed-off-by: Kees Cook

    Alexander Popov
     

23 Aug, 2018

4 commits

  • ->latency_record is defined as

    struct latency_record[LT_SAVECOUNT];

    so use the same macro whie iterating.

    Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Code checks if write is done by current to its own attributes.
    For that get/put pair is unnecessary as it can be done under RCU.

    Note: rcu_read_unlock() can be done even earlier since pointer to a task
    is not dereferenced. It depends if /proc code should look scary or not:

    rcu_read_lock();
    task = pid_task(...);
    rcu_read_unlock();
    if (!task)
    return -ESRCH;
    if (task != current)
    return -EACCESS:

    P.S.: rename "length" variable. Code like this

    length = -EINVAL;

    should not exist.

    Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Patch series "cleanups and refactor of /proc/pid/smaps*".

    The recent regression in /proc/pid/smaps made me look more into the code.
    Especially the issues with smaps_rollup reported in [1] as explained in
    Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
    preparations for that. Patch 1 is me realizing that there's a lot of
    boilerplate left from times where we tried (unsuccessfuly) to mark thread
    stacks in the output.

    Originally I had also plans to rework the translation from
    /proc/pid/*maps* file offsets to the internal structures. Now the offset
    means "vma number", which is not really stable (vma's can come and go
    between read() calls) and there's an extra caching of last vma's address.
    My idea was that offsets would be interpreted directly as addresses, which
    would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
    tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
    long so that might be insufficient somewhere for the unsigned long
    addresses.

    So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
    simpler smaps code, and a lot of unused code removed.

    [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

    This patch (of 4):

    Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") introduced differences between /proc/PID/maps and
    /proc/PID/task/TID/maps to mark thread stacks properly, and this was
    also done for smaps and numa_maps. However it didn't work properly and
    was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to
    report thread stacks").

    Now the is_pid parameter for the related show_*() functions is unused
    and we can remove it together with wrapper functions and ops structures
    that differ for PID and TID cases only in this parameter.

    Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

20 Jun, 2018

1 commit

  • The rewrite of the cmdline fetching missed the fact that we used to also
    return the final terminating NUL character of the last argument. I
    hadn't noticed, and none of the tools I tested cared, but something
    obviously must care, because Michal Kubecek noticed the change in
    behavior.

    Tweak the "find the end" logic to actually include the NUL character,
    and once past the eend of argv, always start the strnlen() at the
    expected (original) argument end.

    This whole "allow people to rewrite their arguments in place" is a nasty
    hack and requires that odd slop handling at the end of the argv array,
    but it's our traditional model, so we continue to support it.

    Repored-and-bisected-by: Michal Kubecek
    Reviewed-and-tested-by: Michal Kubecek
    Cc: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Code is structured like this:

    for ( ... p < last; p++) {
    if (memcmp == 0)
    break;
    }
    if (p >= last)
    ERROR
    OK

    gcc doesn't see that if if lookup succeeds than post loop branch will
    never be taken and skip it.

    [akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*]
    Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan