Eric Lee / smarc-fsl-linux-kernel

21 Oct, 2019

1 commit

e2ab4ef83 Merge tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/masahiroy/linux-kbuild

Pull more Kbuild fixes from Masahiro Yamada:

- fix a bashism of setlocalversion

- do not use the too new --sort option of tar

* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kheaders: substituting --sort in archive creation
scripts: setlocalversion: fix a bashism
kbuild: update comment about KBUILD_ALLDIRS

Linus Torvalds
2019-10-21 00:36:57 +0800

20 Oct, 2019

2 commits

188768f3c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull hrtimer fixlet from Thomas Gleixner:
"A single commit annotating the lockcless access to timer->base with
READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Annotate lockless access to timer->base

Linus Torvalds
2019-10-20 18:25:12 +0800
589f1222e Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull stop-machine fix from Thomas Gleixner:
"A single fix, amending stop machine with WRITE/READ_ONCE() to address
the fallout of KCSAN"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Avoid potential race behaviour

Linus Torvalds
2019-10-20 18:22:25 +0800

19 Oct, 2019

1 commit

aa5de305c kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register ... Browse Code »

Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu
Reviewed-by: Srikar Dronamraju
Cc: Kirill A. Shutemov
Cc: Oleg Nesterov
Cc: Matthew Wilcox (Oracle)
Cc: William Kucharski
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-10-19 18:32:33 +0800

18 Oct, 2019

2 commits

e59b76ff6 Merge tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management fixes from Rafael Wysocki:
"These include a fix for a recent regression in the ACPI CPU
performance scaling code, a PCI device power management fix,
a system shutdown fix related to cpufreq, a removal of an ACPI
suspend-to-idle blacklist entry and a build warning fix.

Specifics:

- Fix possible NULL pointer dereference in the ACPI processor scaling
initialization code introduced by a recent cpufreq update (Rafael
Wysocki).

- Fix possible deadlock due to suspending cpufreq too late during
system shutdown (Rafael Wysocki).

- Make the PCI device system resume code path be more consistent with
its PM-runtime counterpart to fix an issue with missing delay on
transitions from D3cold to D0 during system resume from
suspend-to-idle on some systems (Rafael Wysocki).

- Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
use suspend-to-idle by default (Mario Limonciello).

- Fix build warning in the core system suspend support code (Ben
Dooks)"

* tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor: Avoid NULL pointer dereferences at init time
PCI: PM: Fix pci_power_up()
PM: sleep: include for pm_wq
cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown
ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist

Linus Torvalds
2019-10-18 23:34:04 +0800
b23eb5c74 Merge branches 'pm-cpufreq' and 'pm-sleep' ... Browse Code »

* pm-cpufreq:
ACPI: processor: Avoid NULL pointer dereferences at init time
cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown

* pm-sleep:
PM: sleep: include for pm_wq
ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist

Rafael J. Wysocki
2019-10-18 16:27:55 +0800

17 Oct, 2019

3 commits

b1fc58333 stop_machine: Avoid potential race behaviour ... Browse Code »

Both multi_cpu_stop() and set_state() access multi_stop_data::state
racily using plain accesses. These are subject to compiler
transformations which could break the intended behaviour of the code,
and this situation is detected by KCSAN on both arm64 and x86 (splats
below).

Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the
compiler cannot elide, replay, or tear loads and stores.

In multi_cpu_stop() the two loads of multi_stop_data::state are expected to
be a consistent value, so snapshot the value into a temporary variable to
ensure this.

The state transitions are serialized by atomic manipulation of
multi_stop_data::num_threads, and other fields in multi_stop_data are not
modified while subject to concurrent reads.

KCSAN splat on arm64:

| BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
|
| write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
| set_state+0x80/0xb0
| multi_cpu_stop+0x16c/0x198
| cpu_stopper_thread+0x170/0x298
| smpboot_thread_fn+0x40c/0x560
| kthread+0x1a8/0x1b0
| ret_from_fork+0x10/0x18
|
| read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
| multi_cpu_stop+0xa8/0x198
| cpu_stopper_thread+0x170/0x298
| smpboot_thread_fn+0x40c/0x560
| kthread+0x1a8/0x1b0
| ret_from_fork+0x10/0x18
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
| Hardware name: linux,dummy-virt (DT)

KCSAN splat on x86:

| write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2:
| set_state kernel/stop_machine.c:170 [inline]
| ack_state kernel/stop_machine.c:177 [inline]
| multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227
| cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
| smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
| kthread+0x1b5/0x200 kernel/kthread.c:255
| ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7:
| multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213
| cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
| smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
| kthread+0x1b5/0x200 kernel/kthread.c:255
| ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014

Signed-off-by: Mark Rutland
Signed-off-by: Thomas Gleixner
Acked-by: Marco Elver
Link: https://lkml.kernel.org/r/20191007104536.27276-1-mark.rutland@arm.com

Mark Rutland
2019-10-17 18:47:12 +0800
700dea5a0 kheaders: substituting --sort in archive creation ... Browse Code »

The option --sort=ORDER was only introduced in tar 1.28 (2014), which
is rather new and might not be available in some setups.

This patch tries to replicate the previous behaviour as closely as
possible to fix the kheaders build for older environments. It does
not produce identical archives compared to the previous version due
to minor sorting differences but produces reproducible results itself
in my tests.

Reported-by: Andreas Schwab
Signed-off-by: Dmitry Goldin
Tested-by: Andreas Schwab
Tested-by: Quentin Perret
Signed-off-by: Masahiro Yamada

Dmitry Goldin
2019-10-17 08:08:19 +0800
bc88f85c6 kthread: make __kthread_queue_delayed_work static ... Browse Code »

The __kthread_queue_delayed_work is not exported so
make it static, to avoid the following sparse warning:

kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static?

Signed-off-by: Ben Dooks
Signed-off-by: Linus Torvalds

Ben Dooks
2019-10-17 00:20:58 +0800

16 Oct, 2019

1 commit

02755af0f Merge branch 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux ... Browse Code »

Pull parisc fixes from Helge Deller:

- Fix a parisc-specific fallout of Christoph's
dma_set_mask_and_coherent() patches (Sven)

- Fix a vmap memory leak in ioremap()/ioremap() (Helge)

- Some minor cleanups and documentation updates (Nick, Helge)

* 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Remove 32-bit DMA enforcement from sba_iommu
parisc: Fix vmap memory leak in ioremap()/iounmap()
parisc: prefer __section from compiler_attributes.h
parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
MAINTAINERS: Add hp_sdc drivers to parisc arch

Linus Torvalds
2019-10-16 00:37:01 +0800

15 Oct, 2019

1 commit

b67114db6 parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define ... Browse Code »

Signed-off-by: Helge Deller

Helge Deller
2019-10-15 03:43:54 +0800

14 Oct, 2019

2 commits

ff229eee3 hrtimer: Annotate lockless access to timer->base ... Browse Code »

Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access
to timer->base")

lock_hrtimer_base() fetches timer->base without lock exclusion.

Compiler is allowed to read timer->base twice (even if considered dumb)
which could end up trying to lock migration_base and return
&migration_base.

base = timer->base;
if (likely(base != &migration_base)) {

/* compiler reads timer->base again, and now (base == &migration_base)

raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
if (likely(base == timer->base))
return base; /* == &migration_base ! */

Similarly the write sides must use WRITE_ONCE() to avoid store tearing.

Signed-off-by: Eric Dumazet
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20191008173204.180879-1-edumazet@google.com

Eric Dumazet
2019-10-14 21:51:49 +0800
d4615e5a4 Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"A few tracing fixes:

- Remove lockdown from tracefs itself and moved it to the trace
directory. Have the open functions there do the lockdown checks.

- Fix a few races with opening an instance file and the instance
being deleted (Discovered during the lockdown updates). Kept
separate from the clean up code such that they can be backported to
stable easier.

- Clean up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.

- Fix a regression in the record mcount code.

- Small hw_lat detector tracer fixes.

- A trace_pipe read fix due to not initializing trace_seq"

* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
tracing/hwlat: Report total time spent in all NMIs during the sample
recordmcount: Fix nop_mcount() function
tracing: Do not create tracefs files if tracefs lockdown is in effect
tracing: Add locked_down checks to the open calls of files created for tracefs
tracing: Add tracing_check_open_get_tr()
tracing: Have trace events system open call tracing_open_generic_tr()
tracing: Get trace_array reference for available_tracers files
ftrace: Get a reference counter for the trace_array on filter files
tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")

Linus Torvalds
2019-10-14 05:47:10 +0800

13 Oct, 2019

10 commits

d303de1fc tracing: Initialize iter->seq after zeroing in tracing_read_pipe() ... Browse Code »

A customer reported the following softlockup:

[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002] tracing_read_pipe+0x336/0x3c0
[899688.160002] __vfs_read+0x26/0x140
[899688.160002] vfs_read+0x87/0x130
[899688.160002] SyS_read+0x42/0x90
[899688.160002] do_syscall_64+0x74/0x160

It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.

Crashdump analyze uncovered that iter->seq was completely zeroed
at this point, including iter->seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.

The culprit seems to be in the code:

/* reset all but tr, trace, and overruns */
memset(&iter->seq, 0,
sizeof(struct trace_iterator) -
offsetof(struct trace_iterator, seq));

It was added by the commit 53d0aa773053ab182877 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter->seq looked like:

struct trace_seq {
unsigned char buffer[PAGE_SIZE];
unsigned int len;
};

There was no "size" variable and zeroing was perfectly fine.

The solution is to reinitialize the structure after or without
zeroing.

Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

Signed-off-by: Petr Mladek
Signed-off-by: Steven Rostedt (VMware)

Petr Mladek
2019-10-13 08:49:34 +0800
fc64e4ad8 tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency ... Browse Code »

max_latency is intended to record the maximum ever observed hardware
latency, which may occur in either part of the loop (inner/outer). So
we need to also consider the outer-loop sample when updating
max_latency.

Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu

Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Steven Rostedt (VMware)

Srivatsa S. Bhat (VMware)
2019-10-13 08:49:33 +0800
98dc19c11 tracing/hwlat: Report total time spent in all NMIs during the sample ... Browse Code »

nmi_total_ts is supposed to record the total time spent in *all* NMIs
that occur on the given CPU during the (active portion of the)
sampling window. However, the code seems to be overwriting this
variable for each NMI, thereby only recording the time spent in the
most recent NMI. Fix it by accumulating the duration instead.

Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu

Fixes: 7b2c86250122 ("tracing: Add NMI tracing in hwlat detector")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Steven Rostedt (VMware)

Srivatsa S. Bhat (VMware)
2019-10-13 08:49:33 +0800
17911ff38 tracing: Add locked_down checks to the open calls of files created for tracefs ... Browse Code »

Added various checks on open tracefs calls to see if tracefs is in lockdown
mode, and if so, to return -EPERM.

Note, the event format files (which are basically standard on all machines)
as well as the enabled_functions file (which shows what is currently being
traced) are not lockde down. Perhaps they should be, but it seems counter
intuitive to lockdown information to help you know if the system has been
modified.

Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com

Suggested-by: Linus Torvalds
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2019-10-13 08:48:06 +0800
8530dec63 tracing: Add tracing_check_open_get_tr() ... Browse Code »

Currently, most files in the tracefs directory test if tracing_disabled is
set. If so, it should return -ENODEV. The tracing_disabled is called when
tracing is found to be broken. Originally it was done in case the ring
buffer was found to be corrupted, and we wanted to prevent reading it from
crashing the kernel. But it's also called if a tracing selftest fails on
boot. It's a one way switch. That is, once it is triggered, tracing is
disabled until reboot.

As most tracefs files can also be used by instances in the tracefs
directory, they need to be carefully done. Each instance has a trace_array
associated to it, and when the instance is removed, the trace_array is
freed. But if an instance is opened with a reference to the trace_array,
then it requires looking up the trace_array to get its ref counter (as there
could be a race with it being deleted and the open itself). Once it is
found, a reference is added to prevent the instance from being removed (and
the trace_array associated with it freed).

Combine the two checks (tracing_disabled and trace_array_get()) into a
single helper function. This will also make it easier to add lockdown to
tracefs later.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2019-10-13 08:44:07 +0800
aa07d71f1 tracing: Have trace events system open call tracing_open_generic_tr() ... Browse Code »

Instead of having the trace events system open call open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.

Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2019-10-13 08:43:00 +0800
194c2c74f tracing: Get trace_array reference for available_tracers files ... Browse Code »

As instances may have different tracers available, we need to look at the
trace_array descriptor that shows the list of the available tracers for the
instance. But there's a race between opening the file and an admin
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.

Cc: stable@vger.kernel.org
Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2019-10-13 08:40:50 +0800
9ef16693a ftrace: Get a reference counter for the trace_array on filter files ... Browse Code »

The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
an instance now. They need to take a reference to the instance otherwise
there could be a race between accessing the files and deleting the instance.

It wasn't until the :mod: caching where these file operations started
referencing the trace_array directly.

Cc: stable@vger.kernel.org
Fixes: 673feb9d76ab3 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2019-10-13 08:40:21 +0800
328fefadd Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
quota precision fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/vtime: Fix guest/system mis-accounting on task switch
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

Linus Torvalds
2019-10-13 06:29:54 +0800
465a7e291 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also a couple of updates for new Intel
models (which are technically hw-enablement, but to users it's a fix
to perf behavior on those new CPUs - hope this is fine), an AUX
inheritance fix, event time-sharing fix, and a fix for lost non-perf
NMI events on AMD systems"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
perf/x86/cstate: Add Tiger Lake CPU support
perf/x86/msr: Add Tiger Lake CPU support
perf/x86/intel: Add Tiger Lake CPU support
perf/x86/cstate: Update C-state counters for Ice Lake
perf/x86/msr: Add new CPU model numbers for Ice Lake
perf/x86/cstate: Add Comet Lake CPU support
perf/x86/msr: Add Comet Lake CPU support
perf/x86/intel: Add Comet Lake CPU support
perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp
perf/core: Fix corner case in perf_rotate_context()
perf/core: Rework memory accounting in perf_mmap()
perf/core: Fix inheritance of aux_output groups
perf annotate: Don't return -1 for error when doing BPF disassembly
perf annotate: Return appropriate error code for allocation failures
perf annotate: Fix arch specific ->init() failure errors
perf annotate: Propagate the symbol__annotate() error return
perf annotate: Fix the signedness of failure returns
perf annotate: Propagate perf_env__arch() error
perf evsel: Fall back to global 'perf_env' in perf_evsel__env()
perf tools: Propagate get_cpuid() error
...

Linus Torvalds
2019-10-13 06:15:17 +0800

11 Oct, 2019

1 commit

297cbcccc Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:

- Fix wbt performance regression introduced with the blk-rq-qos
refactoring (Harshad)

- Fix io_uring fileset removal inadvertently killing the workqueue (me)

- Fix io_uring typo in linked command nonblock submission (Pavel)

- Remove spurious io_uring wakeups on request free (Pavel)

- Fix null_blk zoned command error return (Keith)

- Don't use freezable workqueues for backing_dev, also means we can
revert a previous libata hack (Mika)

- Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

* tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
nbd: fix possible sysfs duplicate warning
null_blk: Fix zoned command return code
io_uring: only flush workqueues on fileset removal
io_uring: remove wait loop spurious wakeups
blk-wbt: fix performance regression in wbt scale_up/scale_down
Revert "libata, freezer: avoid block device removal while system is frozen"
bdi: Do not use freezable workqueue
io_uring: fix reversed nonblock flag for link submission

Linus Torvalds
2019-10-11 23:45:32 +0800

10 Oct, 2019

1 commit

f49249d58 PM: sleep: include <linux/pm_runtime.h> for pm_wq ... Browse Code »

Include the for the definition of
pm_wq to avoid the following warning:

kernel/power/main.c:890:25: warning: symbol 'pm_wq' was not declared. Should it be static?

Signed-off-by: Ben Dooks
Signed-off-by: Rafael J. Wysocki

Ben Dooks
2019-10-10 17:11:56 +0800

09 Oct, 2019

4 commits

7fa343b7f perf/core: Fix corner case in perf_rotate_context() ... Browse Code »

In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

$ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

time counts unit events
1.000152973 15,412,480 ref-cycles:D
1.000152973 ref-cycles (0.00%)
1.000152973 cycles (0.00%)
2.000486957 18,263,120 ref-cycles:D
2.000486957 ref-cycles (0.00%)
2.000486957 cycles (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: Thomas Gleixner
Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar

Song Liu
2019-10-09 18:44:13 +0800
d44248a41 perf/core: Rework memory accounting in perf_mmap() ... Browse Code »

perf_mmap() always increases user->locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

| # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
| 0 | 0 | 0 |
| 1 | user_extra | user_extra |
| 2 | 3 * user_extra | 2 * user_extra|
| 3 | 6 * user_extra | 3 * user_extra|
| 4 | 10 * user_extra | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li
Reported-by: Hechao Li
Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Jie Meng
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar

Song Liu
2019-10-09 18:44:12 +0800
68e7a4d66 sched/vtime: Fix guest/system mis-accounting on task switch ... Browse Code »

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

vtime_common_task_switch(prev)
vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: Wanpeng Li
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2019-10-09 18:38:03 +0800
4929a4e6f sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision ... Browse Code »

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
2) Create a few children cgroups.
3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

new_quota: 1148437ns, 1148us
new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld
Signed-off-by: Xuewei Zhang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Phil Auld
Cc: Anton Blanchard
Cc: Ben Segall
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Vincent Guittot
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar

Xuewei Zhang
2019-10-09 18:38:02 +0800

08 Oct, 2019

3 commits

eda57a0e4 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"The usual shower of hotfixes.

Chris's memcg patches aren't actually fixes - they're mature but a few
niggling review issues were late to arrive.

The ocfs2 fixes are quite old - those took some time to get reviewer
attention.

Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
mm/slab-generic"

* emailed patches from Andrew Morton :
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
mm, sl[ou]b: improve memory accounting
mm, memcg: make scan aggression always exclude protection
mm, memcg: make memory.emin the baseline for utilisation determination
mm, memcg: proportional memory.{low,min} reclaim
mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
mm/page_alloc.c: fix a crash in free_pages_prepare()
mm/z3fold.c: claim page in the beginning of free
kernel/sysctl.c: do not override max_threads provided by userspace
memcg: only record foreign writebacks with dirty pages when memcg is not disabled
mm: fix -Wmissing-prototypes warnings
writeback: fix use-after-free in finish_writeback_work()
mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
panic: ensure preemption is disabled during panic()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
ocfs2: clear zero in unaligned direct IO

Linus Torvalds
2019-10-08 07:04:19 +0800
b0f53dbc4 kernel/sysctl.c: do not override max_threads provided by userspace ... Browse Code »

Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.

set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory. This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change. It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.

Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction. While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.

This became more severe when we switched x86 from 4k to 8k kernel
stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.

In the particular case

3.12
kernel.threads-max = 515561

4.4
kernel.threads-max = 200000

Neither of the two values is really insane on 32GB machine.

I am not sure we want/need to tune the max_thread value further. If
anything the tuning should be removed altogether if proven not useful in
general. But we definitely need a way to override this auto-tuning.

Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko
Reviewed-by: "Eric W. Biederman"
Cc: Heinrich Schuchardt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-10-08 06:47:19 +0800
20bb759a6 panic: ensure preemption is disabled during panic() ... Browse Code »

Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled. From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.

This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:

| sysrq: Trigger a crash
| Kernel panic - not syncing: sysrq triggered crash
| CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x148
| show_stack+0x14/0x20
| dump_stack+0xa0/0xc4
| panic+0x140/0x32c
| sysrq_handle_reboot+0x0/0x20
| __handle_sysrq+0x124/0x190
| write_sysrq_trigger+0x64/0x88
| proc_reg_write+0x60/0xa8
| __vfs_write+0x18/0x40
| vfs_write+0xa4/0x1b8
| ksys_write+0x64/0xf0
| __arm64_sys_write+0x14/0x20
| el0_svc_common.constprop.0+0xb0/0x168
| el0_svc_handler+0x28/0x78
| el0_svc+0x8/0xc
| Kernel Offset: disabled
| CPU features: 0x0002,24002004
| Memory Limit: none
| ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
| Ping 2!
| Ping 1!
| Ping 1!
| Ping 2!

The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.

Disable preemption in 'panic()' before re-enabling interrupts.

Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon
Reported-by: Xogium
Reviewed-by: Kees Cook
Cc: Russell King
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Petr Mladek
Cc: Feng Tang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2019-10-08 06:47:19 +0800

07 Oct, 2019

2 commits

f733c6b50 perf/core: Fix inheritance of aux_output groups ... Browse Code »

Commit:

ab43762ef010 ("perf: Allow normal events to output AUX data")

forgets to configure aux_output relation in the inherited groups, which
results in child PEBS events forever failing to schedule.

Fix this by setting up the AUX output link in the inheritance path.

Signed-off-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-10-07 22:50:42 +0800
7cdb85df6 Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping ... Browse Code »

Pull dma-mapping regression fix from Christoph Hellwig:
"Revert an incorret hunk from a patch that caused problems on various
arm boards (Andrey Smirnov)"

* tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: fix false positive warnings in dma_common_free_remap()

Linus Torvalds
2019-10-07 02:10:15 +0800

06 Oct, 2019

2 commits

0e48f51cb Revert "libata, freezer: avoid block device removal while system is frozen" ... Browse Code »

This reverts commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557.

The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.

Acked-by: Rafael J. Wysocki
Signed-off-by: Mika Westerberg
Signed-off-by: Jens Axboe

Mika Westerberg
2019-10-06 23:11:37 +0800
2d00aee21 Merge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/m… ... Browse Code »

…asahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- remove unneeded ar-option and KBUILD_ARFLAGS

- remove long-deprecated SUBDIRS

- fix modpost to suppress false-positive warnings for UML builds

- fix namespace.pl to handle relative paths to ${objtree}, ${srctree}

- make setlocalversion work for /bin/sh

- make header archive reproducible

- fix some Makefiles and documents

* tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kheaders: make headers archive reproducible
kbuild: update compile-test header list for v5.4-rc2
kbuild: two minor updates for Documentation/kbuild/modules.rst
scripts/setlocalversion: clear local variable to make it work for sh
namespace: fix namespace.pl script to support relative paths
video/logo: do not generate unneeded logo C files
video/logo: remove unneeded *.o pattern from clean-files
integrity: remove pointless subdir-$(CONFIG_...)
integrity: remove unneeded, broken attempt to add -fshort-wchar
modpost: fix static EXPORT_SYMBOL warnings for UML build
kbuild: correct formatting of header in kbuild module docs
kbuild: remove SUBDIRS support
kbuild: remove ar-option and KBUILD_ARFLAGS

Linus Torvalds
2019-10-06 03:56:59 +0800

05 Oct, 2019

4 commits

2cf2aa6a6 dma-mapping: fix false positivse warnings in dma_common_free_remap() ... Browse Code »

Commit 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages
helper") changed invalid input check in dma_common_free_remap() from:

if (!area || !area->flags != VM_DMA_COHERENT)

to

if (!area || !area->flags != VM_DMA_COHERENT || !area->pages)

which seem to produce false positives for memory obtained via
dma_common_contiguous_remap()

This triggers the following warning message when doing "reboot" on ZII
VF610 Dev Board Rev B:

WARNING: CPU: 0 PID: 1 at kernel/dma/remap.c:112 dma_common_free_remap+0x88/0x8c
trying to free invalid coherent area: 9ef82980
Modules linked in:
CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.3.0-rc6-next-20190820 #119
Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
Backtrace:
[] (dump_backtrace) from [] (show_stack+0x20/0x24)
r7:8015ed78 r6:00000009 r5:00000000 r4:9f4d9b14
[] (show_stack) from [] (dump_stack+0x24/0x28)
[] (dump_stack) from [] (__warn.part.3+0xcc/0xe4)
[] (__warn.part.3) from [] (warn_slowpath_fmt+0x78/0x94)
r6:00000070 r5:808e540c r4:81c03048
[] (warn_slowpath_fmt) from [] (dma_common_free_remap+0x88/0x8c)
r3:9ef82980 r2:808e53e0
r7:00001000 r6:a0b1e000 r5:a0b1e000 r4:00001000
[] (dma_common_free_remap) from [] (remap_allocator_free+0x60/0x68)
r5:81c03048 r4:9f4d9b78
[] (remap_allocator_free) from [] (__arm_dma_free.constprop.3+0xf8/0x148)
r5:81c03048 r4:9ef82900
[] (__arm_dma_free.constprop.3) from [] (arm_dma_free+0x24/0x2c)
r5:9f563410 r4:80110120
[] (arm_dma_free) from [] (dma_free_attrs+0xa0/0xdc)
[] (dma_free_attrs) from [] (dma_pool_destroy+0xc0/0x154)
r8:9efa8860 r7:808f02f0 r6:808f02d0 r5:9ef82880 r4:9ef82780
[] (dma_pool_destroy) from [] (ehci_mem_cleanup+0x6c/0x150)
r7:9f563410 r6:9efa8810 r5:00000000 r4:9efd0148
[] (ehci_mem_cleanup) from [] (ehci_stop+0xac/0xc0)
r5:9efd0148 r4:9efd0000
[] (ehci_stop) from [] (usb_remove_hcd+0xf4/0x1b0)
r7:9f563410 r6:9efd0074 r5:81c03048 r4:9efd0000
[] (usb_remove_hcd) from [] (host_stop+0x48/0xb8)
r7:9f563410 r6:9efd0000 r5:9f5f4040 r4:9f5f5040
[] (host_stop) from [] (ci_hdrc_host_destroy+0x34/0x38)
r7:9f563410 r6:9f5f5040 r5:9efa8800 r4:9f5f4040
[] (ci_hdrc_host_destroy) from [] (ci_hdrc_remove+0x50/0x10c)
[] (ci_hdrc_remove) from [] (platform_drv_remove+0x34/0x4c)
r7:9f563410 r6:81c4f99c r5:9efa8810 r4:9efa8810
[] (platform_drv_remove) from [] (device_release_driver_internal+0xec/0x19c)
r5:00000000 r4:9efa8810
[] (device_release_driver_internal) from [] (device_release_driver+0x20/0x24)
r7:9f563410 r6:81c41ed0 r5:9efa8810 r4:9f4a1dac
[] (device_release_driver) from [] (bus_remove_device+0xdc/0x108)
[] (bus_remove_device) from [] (device_del+0x150/0x36c)
r7:9f563410 r6:81c03048 r5:9efa8854 r4:9efa8810
[] (device_del) from [] (platform_device_del.part.2+0x20/0x84)
r10:9f563414 r9:809177e0 r8:81cb07dc r7:81c78320 r6:9f563454 r5:9efa8800
r4:9efa8800
[] (platform_device_del.part.2) from [] (platform_device_unregister+0x28/0x34)
r5:9f563400 r4:9efa8800
[] (platform_device_unregister) from [] (ci_hdrc_remove_device+0x1c/0x30)
r5:9f563400 r4:00000001
[] (ci_hdrc_remove_device) from [] (ci_hdrc_imx_remove+0x38/0x118)
r7:81c78320 r6:9f563454 r5:9f563410 r4:9f541010
[] (ci_hdrc_imx_shutdown) from [] (platform_drv_shutdown+0x2c/0x30)
[] (platform_drv_shutdown) from [] (device_shutdown+0x158/0x1f0)
[] (device_shutdown) from [] (kernel_restart_prepare+0x44/0x48)
r10:00000058 r9:9f4d8000 r8:fee1dead r7:379ce700 r6:81c0b280 r5:81c03048
r4:00000000
[] (kernel_restart_prepare) from [] (kernel_restart+0x1c/0x60)
[] (kernel_restart) from [] (__do_sys_reboot+0xe0/0x1d8)
r5:81c03048 r4:00000000
[] (__do_sys_reboot) from [] (sys_reboot+0x18/0x1c)
r8:80101204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
[] (sys_reboot) from [] (ret_fast_syscall+0x0/0x54)
Exception stack(0x9f4d9fa8 to 0x9f4d9ff0)
9fa0: 00000000 00000000 fee1dead 28121969 01234567 379ce700
9fc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00016d04
9fe0: 00028e0c 7ec87c64 000135ec 76c1f410

Restore original invalid input check in dma_common_free_remap() to
avoid this problem.

Fixes: 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages helper")
Signed-off-by: Andrey Smirnov
[hch: just revert the offending hunk instead of creating a new helper]
Signed-off-by: Christoph Hellwig

Andrey Smirnov
2019-10-05 16:24:17 +0800
86cdd2fdc kheaders: make headers archive reproducible ... Browse Code »

In commit 43d8ce9d65a5 ("Provide in-kernel headers to make
extending kernel easier") a new mechanism was introduced, for kernels
>=5.2, which embeds the kernel headers in the kernel image or a module
and exposes them in procfs for use by userland tools.

The archive containing the header files has nondeterminism caused by
header files metadata. This patch normalizes the metadata and utilizes
KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the
default behaviour.

In commit f7b101d33046 ("kheaders: Move from proc to sysfs") it was
modified to use sysfs and the script for generation of the archive was
renamed to what is being patched.

Signed-off-by: Dmitry Goldin
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Masahiro Yamada

Dmitry Goldin
2019-10-05 14:29:49 +0800
e524d16e7 Merge tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux… ... Browse Code »

…/kernel/git/brauner/linux

Pull copy_struct_from_user() helper from Christian Brauner:
"This contains the copy_struct_from_user() helper which got split out
from the openat2() patchset. It is a generic interface designed to
copy a struct from userspace.

The helper will be especially useful for structs versioned by size of
which we have quite a few. This allows for backwards compatibility,
i.e. an extended struct can be passed to an older kernel, or a legacy
struct can be passed to a newer kernel. For the first case (extended
struct, older kernel) the new fields in an extended struct can be set
to zero and the struct safely passed to an older kernel.

The most obvious benefit is that this helper lets us get rid of
duplicate code present in at least sched_setattr(), perf_event_open(),
and clone3(). More importantly it will also help to ensure that users
implementing versioning-by-size end up with the same core semantics.

This point is especially crucial since we have at least one case where
versioning-by-size is used but with slighly different semantics:
sched_setattr(), perf_event_open(), and clone3() all do do similar
checks to copy_struct_from_user() while rt_sigprocmask(2) always
rejects differently-sized struct arguments.

With this pull request we also switch over sched_setattr(),
perf_event_open(), and clone3() to use the new helper"

* tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
usercopy: Add parentheses around assignment in test_copy_struct_from_user
perf_event_open: switch to copy_struct_from_user()
sched_setattr: switch to copy_struct_from_user()
clone3: switch to copy_struct_from_user()
lib: introduce copy_struct_from_user() helper

Linus Torvalds
2019-10-05 01:36:31 +0800
af0622f6a Merge tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull clone3/pidfd fixes from Christian Brauner:
"This contains a couple of fixes:

- Fix pidfd selftest compilation (Shuah Kahn)

Due to a false linking instruction in the Makefile compilation for
the pidfd selftests would fail on some systems.

- Fix compilation for glibc on RISC-V systems (Seth Forshee)

In some scenarios linux/uapi/linux/sched.h is included where
__ASSEMBLY__ is defined causing a build failure because struct
clone_args was not guarded by an #ifndef __ASSEMBLY__.

- Add missing clone3() and struct clone_args kernel-doc (Christian Brauner)

clone3() and struct clone_args were missing kernel-docs. (The goal
is to use kernel-doc for any function or type where it's worth it.)
For struct clone_args this also contains a comment about the fact
that it's versioned by size"

* tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
sched: add kernel-doc for struct clone_args
fork: add kernel-doc for clone3
selftests: pidfd: Fix undefined reference to pthread_create()
sched: Add __ASSEMBLY__ guards around struct clone_args

Linus Torvalds
2019-10-05 01:18:56 +0800