Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

2 commits

ba7eb0e48 proc: io_accounting: Use new infrastructure to fix deadlocks in execve ... Browse Code »

[ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

Bernd Edlinger
2020-10-01 19:17:48 +0800
4301db49e proc: Use new infrastructure to fix deadlocks in execve ... Browse Code »

[ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task->mm is updated on execve under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

Bernd Edlinger
2020-10-01 19:17:48 +0800

17 Jul, 2019

3 commits

295415229 Merge branch 'proc-cmdline' (/proc/<pid>/cmdline fixes) ... Browse Code »

This fixes two problems reported with the cmdline simplification and
cleanup last year:

- the setproctitle() special cases didn't quite match the original
semantics, and it can be noticeable:

https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/

- it could leak an uninitialized byte from the temporary buffer under
the right (wrong) circustances:

https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/

It rewrites the logic entirely, splitting it into two separate commits
(and two separate functions) for the two different cases ("unedited
cmdline" vs "setproctitle() has been used to change the command line").

* proc-cmdline:
/proc//cmdline: add back the setproctitle() special case
/proc//cmdline: remove all the special cases

Linus Torvalds
2019-07-17 01:37:27 +0800
d26d0cd97 /proc/<pid>/cmdline: add back the setproctitle() special case ... Browse Code »

This makes the setproctitle() special case very explicit indeed, and
handles it with a separate helper function entirely. In the process, it
re-instates the original semantics of simply stopping at the first NUL
character when the original last NUL character is no longer there.

[ The original semantics can still be seen in mm/util.c: get_cmdline()
that is limited to a fixed-size buffer ]

This makes the logic about when we use the string lengths etc much more
obvious, and makes it easier to see what we do and what the two very
different cases are.

Note that even when we allow walking past the end of the argument array
(because the setproctitle() might have overwritten and overflowed the
original argv[] strings), we only allow it when it overflows into the
environment region if it is immediately adjacent.

[ Fixed for missing 'count' checks noted by Alexey Izbyshev ]

Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function")
Cc: Jakub Jankowski
Cc: Alexey Dobriyan
Cc: Alexey Izbyshev
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-07-17 00:57:52 +0800
3d712546d /proc/<pid>/cmdline: remove all the special cases ... Browse Code »

Start off with a clean slate that only reads exactly from arg_start to
arg_end, without any oddities. This simplifies the code and in the
process removes the case that caused us to potentially leak an
uninitialized byte from the temporary kernel buffer.

Note that in order to start from scratch with an understandable base,
this simplifies things _too_ much, and removes all the legacy logic to
handle setproctitle() having changed the argument strings.

We'll add back those special cases very differently in the next commit.

Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
Cc: Alexey Izbyshev
Cc: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-07-17 00:57:52 +0800

13 Jul, 2019

3 commits

ac311a14c oom: decouple mems_allowed from oom_unkillable_task ... Browse Code »

Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to
mem_exclusive cpuset") introduces a heuristic where a potential
oom-killer victim is skipped if the intersection of the potential victim
and the current (the process triggered the oom) is empty based on the
reason that killing such victim most probably will not help the current
allocating process.

However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the
heuristic to just decrease the oom_badness scores of such potential
victim based on the reason that the cpuset of such processes might have
changed and previously they may have allocated memory on mems where the
current allocating process can allocate from.

Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a
side effect as the oom_badness is also exposed to the user space through
/proc/[pid]/oom_score, so, readers with different cpusets can read
different oom_score of the same process.

Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same
cpuset") fixed the side effect introduced by 7887a3da753e by moving the
cpuset intersection back to only oom-killer context and out of
oom_badness. However the combination of ab290adbaf8f ("oom: make
oom_unkillable_task() helper function") and 26ebc984913b ("oom:
/proc//oom_score treat kernel thread honestly") unintentionally
brought back the cpuset intersection check into the oom_badness
calculation function.

Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
oom context is also doing cpuset/mempolicy intersection which is quite
wrong and is caught by syzcaller with the following report:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
RSP: 0018:ffff888000127490 EFLAGS: 00010a03
RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Call Trace:
oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
select_bad_process mm/oom_kill.c:374 [inline]
out_of_memory mm/oom_kill.c:1088 [inline]
out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
mem_cgroup_oom mm/memcontrol.c:1905 [inline]
try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
wp_huge_pmd mm/memory.c:3793 [inline]
__handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
__do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
RIP: 0033:0x400590
Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 06 e9 1e 01
00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
Modules linked in:
---[ end trace a65689219582ffff ]---
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
RSP: 0018:ffff888000127490 EFLAGS: 00010a03
RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600

The fix is to decouple the cpuset/mempolicy intersection check from
oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
only done in the global oom context.

[shakeelb@google.com: change function name and update comment]
Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com
Signed-off-by: Shakeel Butt
Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
Acked-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Paul Jackson
Cc: Tetsuo Handa
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2019-07-13 02:05:47 +0800
6ba749ee7 mm, oom: remove redundant task_in_mem_cgroup() check ... Browse Code »

oom_unkillable_task() can be called from three different contexts i.e.
global OOM, memcg OOM and oom_score procfs interface. At the moment
oom_unkillable_task() does a task_in_mem_cgroup() check on the given
process. Since there is no reason to perform task_in_mem_cgroup()
check for global OOM and oom_score procfs interface, those contexts
provide NULL memcg and skips the task_in_mem_cgroup() check. However
for memcg OOM context, the oom_unkillable_task() is always called from
mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
redundant and effectively dead code. So, just remove the
task_in_mem_cgroup() check altogether.

Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
Signed-off-by: Shakeel Butt
Signed-off-by: Tetsuo Handa
Acked-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Paul Jackson
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2019-07-13 02:05:47 +0800
cd9e2bb82 proc: use down_read_killable mmap_sem for /proc/pid/map_files ... Browse Code »

Do not remain stuck forever if something goes wrong. Using a killable
lock permits cleanup of stuck tasks and simplifies investigation.

It seems ->d_revalidate() could return any error (except ECHILD) to abort
validation and pass error as result of lookup sequence.

[akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Roman Gushchin
Reviewed-by: Cyrill Gorcunov
Reviewed-by: Kirill Tkhai
Acked-by: Michal Hocko
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Matthew Wilcox
Cc: Michal Koutný
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2019-07-13 02:05:47 +0800

09 Jul, 2019

1 commit

3431a940b Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 AVX512 status update from Ingo Molnar:
"This adds a new ABI that the main scheduler probably doesn't want to
deal with but HPC job schedulers might want to use: the
AVX512_elapsed_ms field in the new /proc//arch_status task status
file, which allows the user-space job scheduler to cluster such tasks,
to avoid turbo frequency drops"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/filesystems/proc.txt: Add arch_status file
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
proc: Add /proc//arch_status

Linus Torvalds
2019-07-09 08:28:57 +0800

27 Jun, 2019

1 commit

30d158b14 proc: remove useless d_is_dir() check ... Browse Code »

Remove the d_is_dir() check from tgid_pidfd_to_pid().

It is pointless since you should never get &proc_tgid_base_operations
for f_op on a non-directory.

Suggested-by: Al Viro
Signed-off-by: Christian Brauner

Christian Brauner
2019-06-27 18:25:09 +0800

12 Jun, 2019

1 commit

68bc30bb9 proc: Add /proc/<pid>/arch_status ... Browse Code »

Exposing architecture specific per process information is useful for
various reasons. An example is the AVX512 usage on x86 which is important
for task placement for power/performance optimizations.

Adding this information to the existing /prcc/pid/status file would be the
obvious choise, but it has been agreed on that a explicit arch_status file
is better in separating the generic and architecture specific information.

[ tglx: Massage changelog ]

Signed-off-by: Aubrey Li
Signed-off-by: Thomas Gleixner
Acked-by: Andrew Morton
Cc: peterz@infradead.org
Cc: hpa@zytor.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: adobriyan@gmail.com
Cc: aubrey.li@intel.com
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Arjan van de Ven
Cc: Alexey Dobriyan
Cc: Linux API
Link: https://lkml.kernel.org/r/20190606012236.9391-1-aubrey.li@linux.intel.com

Aubrey Li
2019-06-12 17:42:13 +0800

15 May, 2019

1 commit

e02c9b0d6 kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing ... Browse Code »

The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.

Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng
Cc: Alexey Dobriyan
Cc: Fabian Frederick
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lin Feng
2019-05-15 10:52:49 +0800

08 May, 2019

1 commit

f72dae208 Merge tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux ... Browse Code »

Pull selinux updates from Paul Moore:
"We've got a few SELinux patches for the v5.2 merge window, the
highlights are below:

- Add LSM hooks, and the SELinux implementation, for proper labeling
of kernfs. While we are only including the SELinux implementation
here, the rest of the LSM folks have given the hooks a thumbs-up.

- Update the SELinux mdp (Make Dummy Policy) script to actually work
on a modern system.

- Disallow userspace to change the LSM credentials via
/proc/self/attr when the task's credentials are already overridden.

The change was made in procfs because all the LSM folks agreed this
was the Right Thing To Do and duplicating it across each LSM was
going to be annoying"

* tag 'selinux-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
proc: prevent changes to overridden credentials
selinux: Check address length before reading address family
kernfs: fix xattr name handling in LSM helpers
MAINTAINERS: update SELinux file patterns
selinux: avoid uninitialized variable warning
selinux: remove useless assignments
LSM: lsm_hooks.h - fix missing colon in docstring
selinux: Make selinux_kernfs_init_security static
kernfs: initialize security of newly created nodes
selinux: implement the kernfs_init_security hook
LSM: add new hook for kernfs node initialization
kernfs: use simple_xattrs for security attributes
selinux: try security xattr after genfs for kernfs filesystems
kernfs: do not alloc iattrs in kernfs_xattr_get
kernfs: clean up struct kernfs_iattrs
scripts/selinux: fix build
selinux: use kernel linux/socket.h for genheaders and mdp
scripts/selinux: modernize mdp

Linus Torvalds
2019-05-08 09:48:09 +0800

29 Apr, 2019

2 commits

35a196bef proc: prevent changes to overridden credentials ... Browse Code »

Prevent userspace from changing the the /proc/PID/attr values if the
task's credentials are currently overriden. This not only makes sense
conceptually, it also prevents some really bizarre error cases caused
when trying to commit credentials to a task with overridden
credentials.

Cc:
Reported-by: "chengjian (D)"
Signed-off-by: Paul Moore
Acked-by: John Johansen
Acked-by: James Morris
Acked-by: Casey Schaufler

Paul Moore
2019-04-29 21:51:21 +0800
e988e5ec1 proc: Simplify task stack retrieval ... Browse Code »

Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner
Reviewed-by: Alexey Dobriyan
Reviewed-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Andrew Morton
Cc: Steven Rostedt
Cc: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: linux-mm@kvack.org
Cc: David Rientjes
Cc: Catalin Marinas
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport
Cc: Akinobu Mita
Cc: Christoph Hellwig
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy
Cc: Marek Szyprowski
Cc: Johannes Thumshirn
Cc: David Sterba
Cc: Chris Mason
Cc: Josef Bacik
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer
Cc: Alasdair Kergon
Cc: Daniel Vetter
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen
Cc: Maarten Lankhorst
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Tom Zanussi
Cc: Miroslav Benes
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.589304463@linutronix.de

Thomas Gleixner
2019-04-29 18:37:48 +0800

15 Apr, 2019

1 commit

accddc41b latency_top: Remove the ULONG_MAX stack trace hackery ... Browse Code »

No architecture terminates the stack trace with ULONG_MAX anymore. The
consumer terminates on the first zero entry or at the number of entries, so
no functional change.

Remove the cruft.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Steven Rostedt
Cc: Alexander Potapenko
Link: https://lkml.kernel.org/r/20190410103644.853527514@linutronix.de

Thomas Gleixner
2019-04-15 01:58:31 +0800

04 Apr, 2019

1 commit

631b7abac ptrace: Remove maxargs from task_current_syscall() ... Browse Code »

task_current_syscall() has a single user that passes in 6 for maxargs, which
is the maximum arguments that can be used to get system calls from
syscall_get_arguments(). Instead of passing in a number of arguments to
grab, just get 6 arguments. The args argument even specifies that it's an
array of 6 items.

This will also allow changing syscall_get_arguments() to not get a variable
number of arguments, but always grab 6.

Linus also suggested not passing in a bunch of arguments to
task_current_syscall() but to instead pass in a pointer to a structure, and
just fill the structure. struct seccomp_data has almost all the parameters
that is needed except for the stack pointer (sp). As seccomp_data is part of
uapi, and I'm afraid to change it, a new structure was created
"syscall_info", which includes seccomp_data and adds the "sp" field.

Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org

Cc: Andy Lutomirski
Cc: Alexey Dobriyan
Cc: Oleg Nesterov
Cc: Kees Cook
Cc: Al Viro
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Thomas Gleixner
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (Red Hat)
2019-04-04 21:17:15 +0800

17 Mar, 2019

1 commit

a9dce6679 Merge tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull pidfd system call from Christian Brauner:
"This introduces the ability to use file descriptors from /proc//
as stable handles on struct pid. Even if a pid is recycled the handle
will not change. For a start these fds can be used to send signals to
the processes they refer to.

With the ability to use /proc/ fds as stable handles on struct
pid we can fix a long-standing issue where after a process has exited
its pid can be reused by another process. If a caller sends a signal
to a reused pid it will end up signaling the wrong process.

With this patchset we enable a variety of use cases. One obvious
example is that we can now safely delegate an important part of
process management - sending signals - to processes other than the
parent of a given process by sending file descriptors around via scm
rights and not fearing that the given process will have been recycled
in the meantime. It also allows for easy testing whether a given
process is still alive or not by sending signal 0 to a pidfd which is
quite handy.

There has been some interest in this feature e.g. from systems
management (systemd, glibc) and container managers. I have requested
and gotten comments from glibc to make sure that this syscall is
suitable for their needs as well. In the future I expect it to take on
most other pid-based signal syscalls. But such features are left for
the future once they are needed.

This has been sitting in linux-next for quite a while and has not
caused any issues. It comes with selftests which verify basic
functionality and also test that a recycled pid cannot be signaled via
a pidfd.

Jon has written about a prior version of this patchset. It should
cover the basic functionality since not a lot has changed since then:

https://lwn.net/Articles/773459/

The commit message for the syscall itself is extensively documenting
the syscall, including it's functionality and extensibility"

* tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add tests for pidfd_send_signal()
signal: add pidfd_send_signal() syscall

Linus Torvalds
2019-03-17 04:47:14 +0800

13 Mar, 2019

2 commits

94f8f3b02 proc: commit to genradix ... Browse Code »

The new generic radix trees have a simpler API and implementation, and
no limitations on number of elements, so all flex_array users are being
converted

Link: http://lkml.kernel.org/r/20181217131929.11727-6-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet
Reviewed-by: Alexey Dobriyan
Cc: Al Viro
Cc: Dave Hansen
Cc: Eric Paris
Cc: Marcelo Ricardo Leitner
Cc: Matthew Wilcox
Cc: Neil Horman
Cc: Paul Moore
Cc: Pravin B Shelar
Cc: Shaohua Li
Cc: Stephen Smalley
Cc: Vlad Yasevich
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kent Overstreet
2019-03-13 01:04:02 +0800
d5a572a4c proc: calculate end pointer for /proc/*/* lookup at compile time ... Browse Code »

Compilers like to transform loops like

for (i = 0; i < n; i++) {
[use p[i]]
}

into
for (p = p0; p < end; p++) {
...
}

Do it by hand, so that it results in overall simpler loop
and smaller code.

Space savings:

$ ./scripts/bloat-o-meter ../vmlinux-001 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 2/1 up/down: 4/-9 (-5)
Function old new delta
proc_tid_base_lookup 17 19 +2
proc_tgid_base_lookup 17 19 +2
proc_pident_lookup 179 170 -9

The same could be done to proc_pident_readdir(), but the code becomes
bigger for some reason.

[sfr@canb.auug.org.au: merge fix for proc_pident_lookup() API change]
Link: http://lkml.kernel.org/r/20190131160135.4a8ae70b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20190114200422.GB9680@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Stephen Rothwell
Cc: James Morris
Cc: Alexey Dobriyan
Cc: Casey Schaufler
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-03-13 01:04:01 +0800

08 Mar, 2019

2 commits

be37f21a0 Merge tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit ... Browse Code »

Pull audit updates from Paul Moore:
"A lucky 13 audit patches for v5.1.

Despite the rather large diffstat, most of the changes are from two
bug fix patches that move code from one Kconfig option to another.

Beyond that bit of churn, the remaining changes are largely cleanups
and bug-fixes as we slowly march towards container auditing. It isn't
all boring though, we do have a couple of new things: file
capabilities v3 support, and expanded support for filtering on
filesystems to solve problems with remote filesystems.

All changes pass the audit-testsuite. Please merge for v5.1"

* tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: mark expected switch fall-through
audit: hide auditsc_get_stamp and audit_serial prototypes
audit: join tty records to their syscall
audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL
audit: remove unused actx param from audit_rule_match
audit: ignore fcaps on umount
audit: clean up AUDITSYSCALL prototypes and stubs
audit: more filter PATH records keyed on filesystem magic
audit: add support for fcaps v3
audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT
audit: add syscall information to CONFIG_CHANGE records
audit: hand taken context to audit_kill_trees for syscall logging
audit: give a clue what CONFIG_CHANGE op was involved

Linus Torvalds
2019-03-08 04:20:11 +0800
ae5906cee Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security subsystem updates from James Morris:

- Extend LSM stacking to allow sharing of cred, file, ipc, inode, and
task blobs. This paves the way for more full-featured LSMs to be
merged, and is specifically aimed at LandLock and SARA LSMs. This
work is from Casey and Kees.

- There's a new LSM from Micah Morton: "SafeSetID gates the setid
family of syscalls to restrict UID/GID transitions from a given
UID/GID to only those approved by a system-wide whitelist." This
feature is currently shipping in ChromeOS.

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (62 commits)
keys: fix missing __user in KEYCTL_PKEY_QUERY
LSM: Update list of SECURITYFS users in Kconfig
LSM: Ignore "security=" when "lsm=" is specified
LSM: Update function documentation for cap_capable
security: mark expected switch fall-throughs and add a missing break
tomoyo: Bump version.
LSM: fix return value check in safesetid_init_securityfs()
LSM: SafeSetID: add selftest
LSM: SafeSetID: remove unused include
LSM: SafeSetID: 'depend' on CONFIG_SECURITY
LSM: Add 'name' field for SafeSetID in DEFINE_LSM
LSM: add SafeSetID module that gates setid calls
LSM: add SafeSetID module that gates setid calls
tomoyo: Allow multiple use_group lines.
tomoyo: Coding style fix.
tomoyo: Swicth from cred->security to task_struct->security.
security: keys: annotate implicit fall throughs
security: keys: annotate implicit fall throughs
security: keys: annotate implicit fall through
capabilities:: annotate implicit fall through
...

Linus Torvalds
2019-03-08 03:44:01 +0800

06 Mar, 2019

3 commits

08b557751 proc: use seq_puts() everywhere ... Browse Code »

seq_printf() without format specifiers == faster seq_puts()

Link: http://lkml.kernel.org/r/20190114200545.GC9680@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-03-06 13:07:22 +0800
867aaccf1 proc: remove unused argument in proc_pid_lookup() ... Browse Code »

[adobriyan@gmail.com: delete "extern" from prototype]
Link: http://lkml.kernel.org/r/20190114195635.GA9372@avx2
Signed-off-by: Zhikang Zhang
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhikang Zhang
2019-03-06 13:07:21 +0800
3eb39f479 signal: add pidfd_send_signal() syscall ... Browse Code »

The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].

This patch uses file descriptors (fd) from proc/ as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).

/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);

/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).

In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(, ). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.

/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).

/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.

The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.

/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.

/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).

/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.

/* testing */
This patch was tested on x64 and x86.

/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}

int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;

if (argc < 3)
exit(EXIT_FAILURE);

fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}

sig = atoi(argv[2]);

printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);

saved_errno = errno;
close(fd);
errno = saved_errno;

if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}

exit(EXIT_SUCCESS);
}

/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).

Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).

Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/ open (cf. [22]).

Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).

Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.

Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).

Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).

Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.

Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].

/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

Cc: "Eric W. Biederman"
Cc: Jann Horn
Cc: Andy Lutomirsky
Cc: Andrew Morton
Cc: Oleg Nesterov
Cc: Al Viro
Cc: Florian Weimer
Signed-off-by: Christian Brauner
Reviewed-by: Tycho Andersen
Reviewed-by: Kees Cook
Reviewed-by: David Howells
Acked-by: Arnd Bergmann
Acked-by: Thomas Gleixner
Acked-by: Serge Hallyn
Acked-by: Aleksa Sarai

Christian Brauner
2019-03-06 00:03:53 +0800

22 Feb, 2019

1 commit

b2b469939 proc, oom: do not report alien mms when setting oom_score_adj ... Browse Code »

Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc//oom_score_adj will swamp the kernel log and takes ages [1]
to finish. This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.

The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.

The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]

[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Acked-by: Johannes Weiner
Cc: David Rientjes
Cc: Yong-Taek Lee
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-02-22 01:01:00 +0800

26 Jan, 2019

1 commit

4b7d248b3 audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT ... Browse Code »

loginuid and sessionid (and audit_log_session_info) should be part of
CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
which are otherwise dependent on AUDITSYSCALL.

Please see github issue
https://github.com/linux-audit/audit-kernel/issues/104

Signed-off-by: Richard Guy Briggs
[PM: tweaked subject line for better grep'ing]
Signed-off-by: Paul Moore

Richard Guy Briggs
2019-01-26 02:03:23 +0800

09 Jan, 2019

1 commit

6d9c939db procfs: add smack subdir to attrs ... Browse Code »

Back in 2007 I made what turned out to be a rather serious
mistake in the implementation of the Smack security module.
The SELinux module used an interface in /proc to manipulate
the security context on processes. Rather than use a similar
interface, I used the same interface. The AppArmor team did
likewise. Now /proc/.../attr/current will tell you the
security "context" of the process, but it will be different
depending on the security module you're using.

This patch provides a subdirectory in /proc/.../attr for
Smack. Smack user space can use the "current" file in
this subdirectory and never have to worry about getting
SELinux attributes by mistake. Programs that use the
old interface will continue to work (or fail, as the case
may be) as before.

The proposed S.A.R.A security module is dependent on
the mechanism to create its own attr subdirectory.

The original implementation is by Kees Cook.

Signed-off-by: Casey Schaufler
Reviewed-by: Kees Cook
Signed-off-by: Kees Cook

Casey Schaufler
2019-01-09 05:18:44 +0800

05 Jan, 2019

2 commits

afe922c2d fs/proc/base.c: slightly faster /proc/*/limits ... Browse Code »

Header of /proc/*/limits is a fixed string, so print it directly without
formatting specifiers.

Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-01-05 05:13:45 +0800
8da0b4f69 fs/proc/base.c: use ns_capable instead of capable for timerslack_ns ... Browse Code »

Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
in its effective capability set, but the current check looks in the root
namespace instead of the process' user namespace. Since a process is
allowed to do other activities controlled by CAP_SYS_NICE inside a
namespace, it should also be able to adjust timerslack_ns.

Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com
Signed-off-by: Benjamin Gordon
Acked-by: "Eric W. Biederman"
Cc: John Stultz
Cc: "Eric W. Biederman"
Cc: Kees Cook
Cc: "Serge E. Hallyn"
Cc: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Todd Kjos
Cc: Colin Cross
Cc: Nick Kralevich
Cc: Dmitry Shmidt
Cc: Elliott Hughes
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Gordon
2019-01-05 05:13:45 +0800

29 Dec, 2018

1 commit

ca79b0c21 mm: convert totalram_pages and totalhigh_pages variables to atomic ... Browse Code »

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS
Suggested-by: Michal Hocko
Suggested-by: Vlastimil Babka
Reviewed-by: Konstantin Khlebnikov
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun KS
2018-12-29 04:11:47 +0800

02 Nov, 2018

1 commit

2d6bb6adb Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull stackleak gcc plugin from Kees Cook:
"Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
was ported from grsecurity by Alexander Popov. It provides efficient
stack content poisoning at syscall exit. This creates a defense
against at least two classes of flaws:

- Uninitialized stack usage. (We continue to work on improving the
compiler to do this in other ways: e.g. unconditional zero init was
proposed to GCC and Clang, and more plugin work has started too).

- Stack content exposure. By greatly reducing the lifetime of valid
stack contents, exposures via either direct read bugs or unknown
cache side-channels become much more difficult to exploit. This
complements the existing buddy and heap poisoning options, but
provides the coverage for stacks.

The x86 hooks are included in this series (which have been reviewed by
Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
been merged through the arm64 tree (written by Laura Abbott and
reviewed by Mark Rutland and Will Deacon).

With VLAs having been removed this release, there is no need for
alloca() protection, so it has been removed from the plugin"

* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: Drop unneeded stackleak_check_alloca()
stackleak: Allow runtime disabling of kernel stack erasing
doc: self-protection: Add information about STACKLEAK feature
fs/proc: Show STACKLEAK metrics in the /proc file system
lkdtm: Add a test for STACKLEAK
gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

Linus Torvalds
2018-11-02 02:46:27 +0800

06 Oct, 2018

1 commit

f8a00cef1 proc: restrict kernel stack dumps to root ... Browse Code »

Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU. That means that
the stack can change under the stack walker. The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers. This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents. See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails. Therefore, I believe
that this change is unlikely to break things. In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Alexey Dobriyan
Cc: Ken Chen
Cc: Will Deacon
Cc: Laura Abbott
Cc: Andy Lutomirski
Cc: Catalin Marinas
Cc: Josh Poimboeuf
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H . Peter Anvin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2018-10-06 07:32:05 +0800

05 Sep, 2018

1 commit

c8d126275 fs/proc: Show STACKLEAK metrics in the /proc file system ... Browse Code »

Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc//stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.

Suggested-by: Ingo Molnar
Signed-off-by: Alexander Popov
Tested-by: Laura Abbott
Signed-off-by: Kees Cook

Alexander Popov
2018-09-05 01:35:48 +0800

23 Aug, 2018

4 commits

f6d2f584d proc: use macro in /proc/latency hook ... Browse Code »

->latency_record is defined as

struct latency_record[LT_SAVECOUNT];

so use the same macro whie iterating.

Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:46 +0800
41089b6d3 proc: save 2 atomic ops on write to "/proc/*/attr/*" ... Browse Code »

Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.

Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:

rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:

P.S.: rename "length" variable. Code like this

length = -EINVAL;

should not exist.

Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
a44937fe4 proc: put task earlier in /proc/*/fail-nth ... Browse Code »

Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
871305bb2 mm: /proc/pid/*maps remove is_pid and related wrappers ... Browse Code »

Patch series "cleanups and refactor of /proc/pid/smaps*".

The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.

Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.

So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

This patch (of 4):

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to
report thread stacks").

Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.

Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Alexey Dobriyan
Cc: Daniel Colascione
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-08-23 01:52:44 +0800

20 Jun, 2018

1 commit

f5b65348f proc: fix missing final NUL in get_mm_cmdline() rewrite ... Browse Code »

The rewrite of the cmdline fetching missed the fact that we used to also
return the final terminating NUL character of the last argument. I
hadn't noticed, and none of the tools I tested cared, but something
obviously must care, because Michal Kubecek noticed the change in
behavior.

Tweak the "find the end" logic to actually include the NUL character,
and once past the eend of argv, always start the strnlen() at the
expected (original) argument end.

This whole "allow people to rewrite their arguments in place" is a nasty
hack and requires that odd slop handling at the end of the argv array,
but it's our traditional model, so we continue to support it.

Repored-and-bisected-by: Michal Kubecek
Reviewed-and-tested-by: Michal Kubecek
Cc: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-06-20 14:38:28 +0800

15 Jun, 2018

1 commit

26b95137d proc: skip branch in /proc/*/* lookup ... Browse Code »

Code is structured like this:

for ( ... p < last; p++) {
if (memcmp == 0)
break;
}
if (p >= last)
ERROR
OK

gcc doesn't see that if if lookup succeeds than post loop branch will
never be taken and skip it.

[akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*]
Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-15 06:55:24 +0800