Eric Lee / smarc-fsl-linux-kernel

18 Sep, 2019

1 commit

7f2444d38 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:

- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.

An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.

- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.

- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function

- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.

- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.

- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.

- Updates to various clocksource/clockevent drivers and their device
tree bindings.

- The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...

Linus Torvalds
2019-09-18 03:35:15 +0800

17 Sep, 2019

1 commit

22331f895 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 cpu-feature updates from Ingo Molnar:

- Rework the Intel model names symbols/macros, which were decades of
ad-hoc extensions and added random noise. It's now a coherent, easy
to follow nomenclature.

- Add new Intel CPU model IDs:
- "Tiger Lake" desktop and mobile models
- "Elkhart Lake" model ID
- and the "Lightning Mountain" variant of Airmont, plus support code

- Add the new AVX512_VP2INTERSECT instruction to cpufeatures

- Remove Intel MPX user-visible APIs and the self-tests, because the
toolchain (gcc) is not supporting it going forward. This is the
first, lowest-risk phase of MPX removal.

- Remove X86_FEATURE_MFENCE_RDTSC

- Various smaller cleanups and fixes

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/cpu: Update init data for new Airmont CPU model
x86/cpu: Add new Airmont variant to Intel family
x86/cpu: Add Elkhart Lake to Intel family
x86/cpu: Add Tiger Lake to Intel family
x86: Correct misc typos
x86/intel: Add common OPTDIFFs
x86/intel: Aggregate microserver naming
x86/intel: Aggregate big core graphics naming
x86/intel: Aggregate big core mobile naming
x86/intel: Aggregate big core client naming
x86/cpufeature: Explain the macro duplication
x86/ftrace: Remove mcount() declaration
x86/PCI: Remove superfluous returns from void functions
x86/msr-index: Move AMD MSRs where they belong
x86/cpu: Use constant definitions for CPU models
lib: Remove redundant ftrace flag removal
x86/crash: Remove unnecessary comparison
x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
x86: Remove X86_FEATURE_MFENCE_RDTSC
x86/mpx: Remove MPX APIs
...

Linus Torvalds
2019-09-17 09:47:53 +0800

28 Aug, 2019

2 commits

2bbdbdae0 posix-cpu-timers: Get rid of zero checks ... Browse Code »

Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:

if (cache == 0 || new < cache)
cache = new;

Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:

if (new < cache)
cache = new;

This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.

Signed-off-by: Thomas Gleixner
Reviewed-by: Frederic Weisbecker
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de

Thomas Gleixner
2019-08-28 17:50:40 +0800
24db4dd90 rlimit: Rewrite non-sensical RLIMIT_CPU comment ... Browse Code »

The comment above the function which arms RLIMIT_CPU in the posix CPU timer
code makes no sense at all. It claims that the kernel does not return an
error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
bogus as the code does an error check and the rlimit is only set and
activated when the permission checks are ok. In case of a rejection an
appropriate error code is returned.

This is a historical and outdated comment which got dragged along even when
the rlimit handling code was rewritten.

Replace it with an explanation why the setup function is not called when
the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.

Signed-off-by: Thomas Gleixner
Reviewed-by: Frederic Weisbecker
Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de

Thomas Gleixner
2019-08-28 17:50:40 +0800

21 Aug, 2019

1 commit

3e91ec89f arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused arguments ... Browse Code »

Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and
arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored
for future extensions.

Acked-by: Andrey Konovalov
Signed-off-by: Catalin Marinas
Signed-off-by: Will Deacon

Catalin Marinas
2019-08-21 01:17:55 +0800

07 Aug, 2019

1 commit

63f0c6037 arm64: Introduce prctl() options to control the tagged user addresses ABI ... Browse Code »

It is not desirable to relax the ABI to allow tagged user addresses into
the kernel indiscriminately. This patch introduces a prctl() interface
for enabling or disabling the tagged ABI with a global sysctl control
for preventing applications from enabling the relaxed ABI (meant for
testing user-space prctl() return error checking without reconfiguring
the kernel). The ABI properties are inherited by threads of the same
application and fork()'ed children but cleared on execve(). A Kconfig
option allows the overall disabling of the relaxed ABI.

The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
MTE-specific settings like imprecise vs precise exceptions.

Reviewed-by: Kees Cook
Signed-off-by: Catalin Marinas
Signed-off-by: Andrey Konovalov
Signed-off-by: Will Deacon

Catalin Marinas
2019-08-07 01:08:45 +0800

22 Jul, 2019

1 commit

f240652b6 x86/mpx: Remove MPX APIs ... Browse Code »

MPX is being removed from the kernel due to a lack of support in the
toolchain going forward (gcc).

The first step is to remove the userspace-visible ABIs so that applications
will stop using it. The most visible one are the enable/disable prctl()s.
Remove them first.

This is the most minimal and least invasive change needed to ensure that
apps stop using MPX with new kernels.

Signed-off-by: Dave Hansen
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com

Dave Hansen
2019-07-22 17:54:57 +0800

02 Jun, 2019

2 commits

bc81426f5 prctl_set_mm: downgrade mmap_sem to read lock ... Browse Code »

The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
semaphore taken.") added synchronization of reading argument/environment
boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
the coarse use of mmap_sem in similar situations. But there still
remained two places that (mis)use mmap_sem.

get_cmdline should also use arg_lock instead of mmap_sem when it reads the
boundaries.

The second place that should use arg_lock is in prctl_set_mm. By
protecting the boundaries fields with the arg_lock, we can downgrade
mmap_sem to reader lock (analogous to what we already do in
prctl_set_mm_map).

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
Signed-off-by: Michal Koutný
Signed-off-by: Laurent Dufour
Co-developed-by: Laurent Dufour
Reviewed-by: Cyrill Gorcunov
Acked-by: Michal Hocko
Cc: Yang Shi
Cc: Mateusz Guzik
Cc: Kirill Tkhai
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Koutný
2019-06-02 06:51:31 +0800
11bbd8b41 prctl_set_mm: refactor checks from validate_prctl_map ... Browse Code »

Despite comment of validate_prctl_map claims there are no capability
checks, it is not completely true since commit 4d28df6152aa ("prctl: Allow
local CAP_SYS_ADMIN changing exe_file"). Extract the check out of the
function and make the function perform purely arithmetic checks.

This patch should not change any behavior, it is mere refactoring for
following patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com
Signed-off-by: Michal Koutný
Reviewed-by: Kirill Tkhai
Reviewed-by: Cyrill Gorcunov
Cc: Kirill Tkhai
Cc: Laurent Dufour
Cc: Mateusz Guzik
Cc: Michal Hocko
Cc: Yang Shi
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Koutný
2019-06-02 06:51:31 +0800

15 May, 2019

1 commit

a9e73998f kernel/sys.c: prctl: fix false positive in validate_prctl_map() ... Browse Code »

While validating new map we require the @start_data to be strictly less
than @end_data, which is fine for regular applications (this is why this
nit didn't trigger for that long). These members are set from executable
loaders such as elf handers, still it is pretty valid to have a loadable
data section with zero size in file, in such case the start_data is equal
to end_data once kernel loader finishes.

As a result when we're trying to restore such programs the procedure fails
and the kernel returns -EINVAL. From the image dump of a program:

| "mm_start_code": "0x400000",
| "mm_end_code": "0x8f5fb4",
| "mm_start_data": "0xf1bfb0",
| "mm_end_data": "0xf1bfb0",

Thus we need to change validate_prctl_map from strictly less to less or
equal operator use.

Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
Fixes: f606b77f1a9e3 ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
Signed-off-by: Cyrill Gorcunov
Cc: Andrey Vagin
Cc: Dmitry Safonov
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2019-05-15 00:47:44 +0800

08 Mar, 2019

2 commits

b5dd0c658 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- some of the rest of MM

- various misc things

- dynamic-debug updates

- checkpatch

- some epoll speedups

- autofs

- rapidio

- lib/, lib/lzo/ updates

* emailed patches from Andrew Morton : (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...

Linus Torvalds
2019-03-08 11:25:37 +0800
21f63a5da kernel/sys: annotate implicit fall through ... Browse Code »

There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).

This commit remove the following warning:

kernel/sys.c:1748:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203347.17530-1-malat@debian.org
Signed-off-by: Mathieu Malaterre
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Malaterre
2019-03-08 10:31:59 +0800

26 Jan, 2019

1 commit

40852275a LSM: add SafeSetID module that gates setid calls ... Browse Code »

This change ensures that the set*uid family of syscalls in kernel/sys.c
(setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with
the CAP_OPT_INSETID flag, so capability checks in the security_capable
hook can know whether they are being called from within a set*uid
syscall. This change is a no-op by itself, but is needed for the
proposed SafeSetID LSM.

Signed-off-by: Micah Morton
Acked-by: Kees Cook
Signed-off-by: James Morris

Micah Morton
2019-01-26 03:22:43 +0800

14 Jan, 2019

1 commit

b7285b425 kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore ... Browse Code »

UNAME26 is a mechanism to report Linux's version as 2.6.x, for
compatibility with old/broken software. Due to the way it is
implemented, it would have to be updated after 5.0, to keep the
resulting versions unique. Linus Torvalds argued:

"Do we actually need this?

I'd rather let it bitrot, and just let it return random versions. It
will just start again at 2.4.60, won't it?

Anybody who uses UNAME26 for a 5.x kernel might as well think it's
still 4.x. The user space is so old that it can't possibly care about
differences between 4.x and 5.x, can it?

The only thing that matters is that it shows "2.4.",
which it will do regardless"

Signed-off-by: Jonathan Neuschäfer
Signed-off-by: Linus Torvalds

Jonathan Neuschäfer
2019-01-14 06:38:03 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

14 Dec, 2018

1 commit

ba8308856 arm64: add prctl control for resetting ptrauth keys ... Browse Code »

Add an arm64-specific prctl to allow a thread to reinitialize its
pointer authentication keys to random values. This can be useful when
exec() is not used for starting new processes, to ensure that different
processes still have different keys.

Signed-off-by: Kristina Martsenko
Signed-off-by: Will Deacon

Kristina Martsenko
2018-12-14 00:42:46 +0800

21 Sep, 2018

1 commit

3bf181bc5 kernel/sys.c: remove duplicated include ... Browse Code »

Link: http://lkml.kernel.org/r/20180821133424.18716-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

YueHaibing
2018-09-21 04:01:11 +0800

25 Aug, 2018

1 commit

4def19636 Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:

- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry

- two mutex vs copy_to_user fixes from Jann Horn

- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()

Linus Torvalds
2018-08-25 00:25:39 +0800

11 Aug, 2018

1 commit

42a0cc347 sys: don't hold uts_sem while accessing userspace memory ... Browse Code »

Holding uts_sem as a writer while accessing userspace memory allows a
namespace admin to stall all processes that attempt to take uts_sem.
Instead, move data through stack buffers and don't access userspace memory
while uts_sem is held.

Cc: stable@vger.kernel.org
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn
Signed-off-by: Eric W. Biederman

Jann Horn
2018-08-11 15:05:53 +0800

19 Jun, 2018

1 commit

dc1b7b6ca sysinfo: Remove get_monotonic_boottime() ... Browse Code »

get_monotonic_boottime() is deprecated because it uses the old 'timespec'
structure. This replaces one of the last callers with a call to
ktime_get_boottime.

Signed-off-by: Arnd Bergmann
Signed-off-by: Thomas Gleixner
Reviewed-by: Cyrill Gorcunov
Cc: Andrew Morton
Cc: y2038@lists.linaro.org
Cc: Dominik Brodowski
Cc: Cyrill Gorcunov
Link: https://lkml.kernel.org/r/20180618150114.849216-1-arnd@arndb.de

Arnd Bergmann
2018-06-19 15:56:27 +0800

08 Jun, 2018

1 commit

88aa7cc68 mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct ... Browse Code »

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.

[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Reviewed-by: Cyrill Gorcunov
Acked-by: Michal Hocko
Cc: Alexey Dobriyan
Cc: Matthew Wilcox
Cc: Mateusz Guzik
Cc: Kirill Tkhai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2018-06-08 08:34:34 +0800

26 May, 2018

1 commit

23d6aef74 kernel/sys.c: fix potential Spectre v1 issue ... Browse Code »

`resource' can be controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)

Fix this by sanitizing *resource* before using it to index
current->signal->rlim

Notice that given that speculation windows are large, the policy is to
kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com
Signed-off-by: Gustavo A. R. Silva
Reviewed-by: Andrew Morton
Cc: Alexei Starovoitov
Cc: Dan Williams
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gustavo A. R. Silva
2018-05-26 09:12:11 +0800

03 May, 2018

2 commits

7bbf1373e nospec: Allow getting/setting on non-current task ... Browse Code »

Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.

This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).

Signed-off-by: Kees Cook
Signed-off-by: Thomas Gleixner

Kees Cook
2018-05-03 19:55:51 +0800
b617cfc85 prctl: Add speculation control prctls ... Browse Code »

Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.

PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:

Bit Define Description
0 PR_SPEC_PRCTL Mitigation can be controlled per task by
PR_SET_SPECULATION_CTRL
1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
disabled
2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
enabled

If all bits are 0 the CPU is not affected by the speculation misfeature.

If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.

PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.

The common return values are:

EINVAL prctl is not implemented by the architecture or the unused prctl()
arguments are not 0
ENODEV arg2 is selecting a not supported speculation misfeature

PR_SET_SPECULATION_CTRL has these additional return values:

ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO prctl control of the selected speculation misfeature is disabled

The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.

Based on an initial patch from Tim Chen and mostly rewritten.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Reviewed-by: Konrad Rzeszutek Wilk

Thomas Gleixner
2018-05-03 19:55:50 +0800

03 Apr, 2018

3 commits

e2aaa9f42 kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid() ... Browse Code »

Using this helper allows us to avoid the in-kernel call to the
sys_setsid() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_setsid().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:16:06 +0800
e530dca58 kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c ... Browse Code »

Using these helpers allows us to avoid the in-kernel calls to these
syscalls: sys_setregid(), sys_setgid(), sys_setreuid(), sys_setuid(),
sys_setresuid(), sys_setresgid(), sys_setfsuid(), and sys_setfsgid().

The ksys_ prefix denotes that these function are meant as a drop-in
replacement for the syscall. In particular, they use the same calling
convention.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: Eric W. Biederman
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:30 +0800
192c58073 kernel: add do_getpgid() helper; remove internal call to sys_getpgid() ... Browse Code »

Using the do_getpgid() helper removes an in-kernel call to the
sys_getpgid() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:28 +0800

15 Dec, 2017

1 commit

8b2770a4e fix typo in assignment of fs default overflow gid ... Browse Code »

The patch remains without practical effect since both macros carry
identical values. Still, it might become a problem in the future if
(for whatever reason) the default overflow uid and gid differ. The
DEFAULT_FS_OVERFLOWGID macro was previously unused.

Signed-off-by: Wolffhardt Schwabe
Signed-off-by: Anatoliy Cherepantsev
Signed-off-by: Eric W. Biederman

Wolffhardt Schwabe
2017-12-15 06:01:45 +0800

16 Nov, 2017

1 commit

c9b012e5f Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"The big highlight is support for the Scalable Vector Extension (SVE)
which required extensive ABI work to ensure we don't break existing
applications by blowing away their signal stack with the rather large
new vector context ( of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
arm64: Make ARMV8_DEPRECATED depend on SYSCTL
arm64: Implement __lshrti3 library function
arm64: support __int128 on gcc 5+
arm64/sve: Add documentation
arm64/sve: Detect SVE and activate runtime support
arm64/sve: KVM: Hide SVE from CPU features exposed to guests
arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
arm64/sve: KVM: Prevent guests from using SVE
arm64/sve: Add sysctl to set the default vector length for new processes
arm64/sve: Add prctl controls for userspace vector length management
arm64/sve: ptrace and ELF coredump support
arm64/sve: Preserve SVE registers around EFI runtime service calls
arm64/sve: Preserve SVE registers around kernel-mode NEON use
arm64/sve: Probe SVE capabilities and usable vector lengths
arm64: cpufeature: Move sys_caps_initialised declarations
arm64/sve: Backend logic for setting the vector length
arm64/sve: Signal handling support
arm64/sve: Support vector length resetting for new processes
arm64/sve: Core task context handling
arm64/sve: Low-level CPU setup
...

Linus Torvalds
2017-11-16 02:56:56 +0800

03 Nov, 2017

1 commit

2d2123bc7 arm64/sve: Add prctl controls for userspace vector length management ... Browse Code »

This patch adds two arm64-specific prctls, to permit userspace to
control its vector length:

* PR_SVE_SET_VL: set the thread's SVE vector length and vector
length inheritance mode.

* PR_SVE_GET_VL: get the same information.

Although these prctls resemble instruction set features in the SVE
architecture, they provide additional control: the vector length
inheritance mode is Linux-specific and nothing to do with the
architecture, and the architecture does not permit EL0 to set its
own vector length directly. Both can be used in portable tools
without requiring the use of SVE instructions.

Signed-off-by: Dave Martin
Reviewed-by: Catalin Marinas
Cc: Alex Bennée
[will: Fixed up prctl constants to avoid clash with PDEATHSIG]
Signed-off-by: Will Deacon

Dave Martin
2017-11-03 23:24:19 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

20 Jul, 2017

1 commit

4d28df615 prctl: Allow local CAP_SYS_ADMIN changing exe_file ... Browse Code »

During checkpointing and restore of userspace tasks
we bumped into the situation, that it's not possible
to restore the tasks, which user namespace does not
have uid 0 or gid 0 mapped.

People create user namespace mappings like they want,
and there is no a limitation on obligatory uid and gid
"must be mapped". So, if there is no uid 0 or gid 0
in the mapping, it's impossible to restore mm->exe_file
of the processes belonging to this user namespace.

Also, there is no a workaround. It's impossible
to create a temporary uid/gid mapping, because
only one write to /proc/[pid]/uid_map and gid_map
is allowed during a namespace lifetime.
If there is an entry, then no more mapings can't be
written. If there isn't an entry, we can't write
there too, otherwise user task won't be able
to do that in the future.

The patch changes the check, and looks for CAP_SYS_ADMIN
instead of zero uid and gid. This allows to restore
a task independently of its user namespace mappings.

Signed-off-by: Kirill Tkhai
CC: Andrew Morton
CC: Serge Hallyn
CC: "Eric W. Biederman"
CC: Oleg Nesterov
CC: Michal Hocko
CC: Andrei Vagin
CC: Cyrill Gorcunov
CC: Stanislav Kinsburskiy
CC: Pavel Tikhomirov
Reviewed-by: Cyrill Gorcunov
Signed-off-by: Eric W. Biederman

Kirill Tkhai
2017-07-20 20:46:07 +0800

13 Jul, 2017

1 commit

58c7ffc07 fix a braino in compat_sys_getrlimit() ... Browse Code »

Reported-and-tested-by: Meelis Roos
Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native"
Signed-off-by: Al Viro
Acked-by: David S. Miller
Signed-off-by: Linus Torvalds

Al Viro
2017-07-13 00:15:00 +0800

11 Jul, 2017

1 commit

186003323 mm: make PR_SET_THP_DISABLE immediately active ... Browse Code »

PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.

The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set. This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.

Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU. In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage. Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated. However, khugepaged
collapses the pages and the expected page faults do not occur.

In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.

Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.

It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.

Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko
Signed-off-by: Mike Rapoport
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Arnd Bergmann
Cc: "Kirill A. Shutemov"
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:31 +0800

07 Jul, 2017

1 commit

c85686398 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc compat stuff updates from Al Viro:
"This part is basically untangling various compat stuff. Compat
syscalls moved to their native counterparts, getting rid of quite a
bit of double-copying and/or set_fs() uses. A lot of field-by-field
copyin/copyout killed off.

- kernel/compat.c is much closer to containing just the
copyin/copyout of compat structs. Not all compat syscalls are gone
from it yet, but it's getting there.

- ipc/compat_mq.c killed off completely.

- block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
drivers/block/floppy.c where they belong. Yes, there are several
drivers that implement some of the same ioctls. Some are m68k and
one is 32bit-only pmac. drivers/block/floppy.c is the only one in
that bunch that can be built on biarch"

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: move compat syscalls to native ones
usbdevfs: get rid of field-by-field copyin
compat_hdio_ioctl: get rid of set_fs()
take floppy compat ioctls to sodding floppy.c
ipmi: get rid of field-by-field __get_user()
ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
rt_sigtimedwait(): move compat to native
select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
put_compat_rusage(): switch to copy_to_user()
sigpending(): move compat to native
getrlimit()/setrlimit(): move compat to native
times(2): move compat to native
compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
do_sigaltstack(): lift copying to/from userland into callers
take compat_sys_old_getrlimit() to native syscall
trim __ARCH_WANT_SYS_OLD_GETRLIMIT

Linus Torvalds
2017-07-07 11:57:13 +0800

10 Jun, 2017

2 commits

d9e968cb9 getrlimit()/setrlimit(): move compat to native ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-06-10 11:51:33 +0800
ca2406ed5 times(2): move compat to native ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-06-10 11:51:28 +0800

28 May, 2017

1 commit

613763a1f take compat_sys_old_getrlimit() to native syscall ... Browse Code »

... and sanitize the ifdefs in there

Signed-off-by: Al Viro

Al Viro
2017-05-28 03:38:06 +0800

22 May, 2017

1 commit

ce72a16fa wait4(2)/waitid(2): separate copying rusage to userland ... Browse Code »

New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(),
sys_wait4() and their compat variants switched to those. Copying
struct rusage to userland is left to syscall itself. For
compat_sys_wait4() that eliminates the use of set_fs() completely.
For compat_sys_waitid() it's still needed (for siginfo handling);
that will change shortly.

Signed-off-by: Al Viro

Al Viro
2017-05-22 01:11:00 +0800

06 May, 2017

1 commit

e579dde65 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.

There is a lot of good development going on but unfortunately it
didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root

Linus Torvalds
2017-05-06 02:08:43 +0800