Eric Lee / smarc-fsl-linux-kernel

21 Jul, 2017

4 commits

7a68ada6e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2017-07-21 10:38:43 +0800
96080f697 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) BPF verifier signed/unsigned value tracking fix, from Daniel
Borkmann, Edward Cree, and Josef Bacik.

2) Fix memory allocation length when setting up calls to
->ndo_set_mac_address, from Cong Wang.

3) Add a new cxgb4 device ID, from Ganesh Goudar.

4) Fix FIB refcount handling, we have to set it's initial value before
the configure callback (which can bump it). From David Ahern.

5) Fix double-free in qcom/emac driver, from Timur Tabi.

6) A bunch of gcc-7 string format overflow warning fixes from Arnd
Bergmann.

7) Fix link level headroom tests in ip_do_fragment(), from Vasily
Averin.

8) Fix chunk walking in SCTP when iterating over error and parameter
headers. From Alexander Potapenko.

9) TCP BBR congestion control fixes from Neal Cardwell.

10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.

11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
Wang.

12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
from Gao Feng.

13) Cannot release skb->dst in UDP if IP options processing needs it.
From Paolo Abeni.

14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
Levin and myself.

15) Revert some rtnetlink notification changes that are causing
regressions, from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
net: bonding: Fix transmit load balancing in balance-alb mode
rds: Make sure updates to cp_send_gen can be observed
net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
ipv4: initialize fib_trie prior to register_netdev_notifier call.
rtnetlink: allocate more memory for dev_set_mac_address()
net: dsa: b53: Add missing ARL entries for BCM53125
bpf: more tests for mixed signed and unsigned bounds checks
bpf: add test for mixed signed and unsigned bounds checks
bpf: fix up test cases with mixed signed/unsigned bounds
bpf: allow to specify log level and reduce it for test_verifier
bpf: fix mixed signed/unsigned derived min/max value bounds
ipv6: avoid overflow of offset in ip6_find_1stfragopt
net: tehuti: don't process data if it has not been copied from userspace
Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
NET: dwmac: Make dwmac reset unconditional
net: Zero terminate ifr_name in dev_ifname().
wireless: wext: terminate ifr name coming from userspace
netfilter: fix netfilter_net_init() return
...

Linus Torvalds
2017-07-21 07:33:39 +0800
4cabc5b18 bpf: fix mixed signed/unsigned derived min/max value bounds ... Browse Code »

Edward reported that there's an issue in min/max value bounds
tracking when signed and unsigned compares both provide hints
on limits when having unknown variables. E.g. a program such
as the following should have been rejected:

0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (18) r1 = 0xffff8a94cda93400
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+7
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
7: (7a) *(u64 *)(r10 -16) = -8
8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = -1
10: (2d) if r1 > r2 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
11: (65) if r1 s> 0x1 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
12: (0f) r0 += r1
13: (72) *(u8 *)(r0 +0) = 0
R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
14: (b7) r0 = 0
15: (95) exit

What happens is that in the first part ...

8: (79) r1 = *(u64 *)(r10 -16)
9: (b7) r2 = -1
10: (2d) if r1 > r2 goto pc+3

... r1 carries an unsigned value, and is compared as unsigned
against a register carrying an immediate. Verifier deduces in
reg_set_min_max() that since the compare is unsigned and operation
is greater than (>), that in the fall-through/false case, r1's
minimum bound must be 0 and maximum bound must be r2. Latter is
larger than the bound and thus max value is reset back to being
'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
'R1=inv,min_value=0'. The subsequent test ...

11: (65) if r1 s> 0x1 goto pc+2

... is a signed compare of r1 with immediate value 1. Here,
verifier deduces in reg_set_min_max() that since the compare
is signed this time and operation is greater than (>), that
in the fall-through/false case, we can deduce that r1's maximum
bound must be 1, meaning with prior test, we result in r1 having
the following state: R1=inv,min_value=0,max_value=1. Given that
the actual value this holds is -8, the bounds are wrongly deduced.
When this is being added to r0 which holds the map_value(_adj)
type, then subsequent store access in above case will go through
check_mem_access() which invokes check_map_access_adj(), that
will then probe whether the map memory is in bounds based
on the min_value and max_value as well as access size since
the actual unknown value is min_value = r1 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
11: (b7) r7 = 1
12: (65) if r7 s> 0x0 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
13: (b7) r0 = 0
14: (95) exit

from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
15: (0f) r7 += r1
16: (65) if r7 s> 0x4 goto pc+2
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
17: (0f) r0 += r7
18: (72) *(u8 *)(r0 +0) = 0
R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
19: (b7) r0 = 0
20: (95) exit

Meaning, in adjust_reg_min_max_vals() we must also reset range
values on the dst when src/dst registers have mixed signed/
unsigned derived min/max value bounds with one unbounded value
as otherwise they can be added together deducing false boundaries.
Once both boundaries are established from either ALU ops or
compare operations w/o mixing signed/unsigned insns, then they
can safely be added to other regs also having both boundaries
established. Adding regs with one unbounded side to a map value
where the bounded side has been learned w/o mixing ops is
possible, but the resulting map value won't recover from that,
meaning such op is considered invalid on the time of actual
access. Invalid bounds are set on the dst reg in case i) src reg,
or ii) in case dst reg already had them. The only way to recover
would be to perform i) ALU ops but only 'add' is allowed on map
value types or ii) comparisons, but these are disallowed on
pointers in case they span a range. This is fine as only BPF_JEQ
and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
which potentially turn them into PTR_TO_MAP_VALUE type depending
on the branch, so only here min/max value cannot be invalidated
for them.

In terms of state pruning, value_from_signed is considered
as well in states_equal() when dealing with adjusted map values.
With regards to breaking existing programs, there is a small
risk, but use-cases are rather quite narrow where this could
occur and mixing compares probably unlikely.

Joint work with Josef and Edward.

[0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Reported-by: Edward Cree
Signed-off-by: Daniel Borkmann
Signed-off-by: Edward Cree
Signed-off-by: Josef Bacik
Signed-off-by: David S. Miller

Daniel Borkmann
2017-07-21 06:20:27 +0800
f58781c98 Merge branch 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fix from Paul Moore:
"A small audit fix, just a single line, to plug a memory leak in some
audit error handling code"

* 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit:
audit: fix memleak in auditd_send_unicast_skb.

Linus Torvalds
2017-07-21 01:22:26 +0800

19 Jul, 2017

2 commits

e06fdaf40 Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull structure randomization updates from Kees Cook:
"Now that IPC and other changes have landed, enable manual markings for
randstruct plugin, including the task_struct.

This is the rest of what was staged in -next for the gcc-plugins, and
comes in three patches, largest first:

- mark "easy" structs with __randomize_layout

- mark task_struct with an optional anonymous struct to isolate the
__randomize_layout section

- mark structs to opt _out_ of automated marking (which will come
later)

And, FWIW, this continues to pass allmodconfig (normal and patched to
enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
randstruct: opt-out externally exposed function pointer structs
task_struct: Allow randomized layout
randstruct: Mark various structs for randomization

Linus Torvalds
2017-07-19 23:55:18 +0800
b0659ae5e audit: fix memleak in auditd_send_unicast_skb. ... Browse Code »

Found this issue by kmemleak report, auditd_send_unicast_skb
did not free skb if rcu_dereference(auditd_conn) returns null.

unreferenced object 0xffff88082568ce00 (size 256):
comm "auditd", pid 1119, jiffies 4294708499
backtrace:
[] kmemleak_alloc+0x4a/0xa0
[] kmem_cache_alloc_node+0xcc/0x210
[] __alloc_skb+0x5d/0x290
[] audit_make_reply+0x54/0xd0
[] audit_receive_msg+0x967/0xd70
----------------
(gdb) list *audit_receive_msg+0x967
0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133).
1132 skb = audit_make_reply(0, AUDIT_REPLACE, 0,
0, &pvnr, sizeof(pvnr));
---------------
[] audit_receive+0x52/0xa0
[] netlink_unicast+0x181/0x240
[] netlink_sendmsg+0x2c2/0x3b0
[] sock_sendmsg+0x38/0x50
[] SYSC_sendto+0x102/0x190
[] SyS_sendto+0xe/0x10
[] entry_SYSCALL_64_fastpath+0x1a/0xa5
[] 0xffffffffffffffff

Signed-off-by: Shu Wang
Signed-off-by: Paul Moore

Shu Wang
2017-07-19 22:28:54 +0800

18 Jul, 2017

6 commits

935acd3f5 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq fix from Thomas Gleixner:
"Fix the fallout from reworking the locking and resource management in
request/free_irq()"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Keep chip buslock across irq_request/release_resources()

Linus Torvalds
2017-07-18 04:00:36 +0800
31ba04d99 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull SMP fix from Thomas Gleixner:
"Replace the bogus BUG_ON in the cpu hotplug code"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp/hotplug: Replace BUG_ON and react useful

Linus Torvalds
2017-07-18 03:54:51 +0800
2ddf71e23 net: add notifier hooks for devmap bpf map ... Browse Code »

The BPF map devmap holds a refcnt on the net_device structure when
it is in the map. We need to do this to ensure on driver unload we
don't lose a dev reference.

However, its not very convenient to have to manually unload the map
when destroying a net device so add notifier handlers to do the cleanup
automatically. But this creates a race between update/destroy BPF
syscall and programs and the unregister netdev hook.

Unfortunately, the best I could come up with is either to live with
requiring manual removal of net devices from the map before removing
the net device OR to add a mutex in devmap to ensure the map is not
modified while we are removing a device. The fallout also requires
that BPF programs no longer update/delete the map from the BPF program
side because the mutex may sleep and this can not be done from inside
an rcu critical section. This is not a real problem though because I
have not come up with any use cases where this is actually useful in
practice. If/when we come up with a compelling user for this we may
need to revisit this.

Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

John Fastabend
2017-07-18 00:48:06 +0800
11393cc9b xdp: Add batching support to redirect map ... Browse Code »

For performance reasons we want to avoid updating the tail pointer in
the driver tx ring as much as possible. To accomplish this we add
batching support to the redirect path in XDP.

This adds another ndo op "xdp_flush" that is used to inform the driver
that it should bump the tail pointer on the TX ring.

Signed-off-by: John Fastabend
Signed-off-by: Jesper Dangaard Brouer
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

John Fastabend
2017-07-18 00:48:06 +0800
97f91a7cf bpf: add bpf_redirect_map helper routine ... Browse Code »

BPF programs can use the devmap with a bpf_redirect_map() helper
routine to forward packets to netdevice in map.

Signed-off-by: John Fastabend
Signed-off-by: Jesper Dangaard Brouer
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

John Fastabend
2017-07-18 00:48:06 +0800
546ac1ffb bpf: add devmap, a map for storing net device references ... Browse Code »

Device map (devmap) is a BPF map, primarily useful for networking
applications, that uses a key to lookup a reference to a netdevice.

The map provides a clean way for BPF programs to build virtual port
to physical port maps. Additionally, it provides a scoping function
for the redirect action itself allowing multiple optimizations. Future
patches will leverage the map to provide batching at the XDP layer.

Another optimization/feature, that is not yet implemented, would be
to support multiple netdevices per key to support efficient multicast
and broadcast support.

Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

John Fastabend
2017-07-18 00:48:06 +0800

16 Jul, 2017

1 commit

78dcf7342 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull ->s_options removal from Al Viro:
"Preparations for fsmount/fsopen stuff (coming next cycle). Everything
gets moved to explicit ->show_options(), killing ->s_options off +
some cosmetic bits around fs/namespace.c and friends. Basically, the
stuff needed to work with fsmount series with minimum of conflicts
with other work.

It's not strictly required for this merge window, but it would reduce
the PITA during the coming cycle, so it would be nice to have those
bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
isofs: Fix isofs_show_options()
VFS: Kill off s_options and helpers
orangefs: Implement show_options
9p: Implement show_options
isofs: Implement show_options
afs: Implement show_options
affs: Implement show_options
befs: Implement show_options
spufs: Implement show_options
bpf: Implement show_options
ramfs: Implement show_options
pstore: Implement show_options
omfs: Implement show_options
hugetlbfs: Implement show_options
VFS: Don't use save/replace_mount_options if not using generic_show_options
VFS: Provide empty name qstr
VFS: Make get_filesystem() return the affected filesystem
VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Provide a function to create a NUL-terminated string from unterminated data

Linus Torvalds
2017-07-16 03:00:42 +0800

15 Jul, 2017

3 commits

e37720e25 Merge tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management fixes from Rafael Wysocki:
"These fix a recently exposed issue in the PCI device wakeup code and
one older problem related to PCI device wakeup that has been reported
recently, modify one more piece of computations in intel_pstate to get
rid of a rounding error, fix a possible race in the schedutil cpufreq
governor, fix the device PM QoS sysfs interface to correctly handle
invalid user input, fix return values of two probe routines in devfreq
drivers and constify an attribute_group structure in devfreq.

Specifics:

- Avoid clearing the PCI PME Enable bit for devices as a result of
config space restoration which confuses AML executed afterward and
causes wakeup events to be lost on some systems (Rafael Wysocki).

- Fix the native PCIe PME interrupts handling in the cases when the
PME IRQ is set up as a system wakeup one so that runtime PM remote
wakeup works as expected after system resume on systems where that
happens (Rafael Wysocki).

- Fix the device PM QoS sysfs interface to handle invalid user input
correctly instead of using an unititialized variable value as the
latency tolerance for the device at hand (Dan Carpenter).

- Get rid of one more rounding error from intel_pstate computations
(Srinivas Pandruvada).

- Fix the schedutil cpufreq governor to prevent it from possibly
accessing unititialized data structures from governor callbacks in
some cases on systems when multiple CPUs share a single cpufreq
policy object (Vikram Mulukutla).

- Fix the return values of probe routines in two devfreq drivers
(Gustavo Silva).

- Constify an attribute_group structure in devfreq (Arvind Yadav)"

* tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI / PM: Fix native PME handling during system suspend/resume
PCI / PM: Restore PME Enable after config space restoration
cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race
PM / QoS: return -EINVAL for bogus strings
cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
PM / devfreq: constify attribute_group structures.
PM / devfreq: tegra: fix error return code in tegra_devfreq_probe()
PM / devfreq: rk3399_dmc: fix error return code in rk3399_dmcfreq_probe()

Linus Torvalds
2017-07-15 13:24:25 +0800
6d7964a72 kmod: throttle kmod thread limit ... Browse Code »

If we reach the limit of modprobe_limit threads running the next
request_module() call will fail. The original reason for adding a kill
was to do away with possible issues with in old circumstances which would
create a recursive series of request_module() calls.

We can do better than just be super aggressive and reject calls once we've
reached the limit by simply making pending callers wait until the
threshold has been reduced, and then throttling them in, one by one.

This throttling enables requests over the kmod concurrent limit to be
processed once a pending request completes. Only the first item queued up
to wait is woken up. The assumption here is once a task is woken it will
have no other option to also kick the queue to check if there are more
pending tasks -- regardless of whether or not it was successful.

By throttling and processing only max kmod concurrent tasks we ensure we
avoid unexpected fatal request_module() calls, and we keep memory
consumption on module loading to a minimum.

With x86_64 qemu, with 4 cores, 4 GiB of RAM it takes the following run
time to run both tests:

time ./kmod.sh -t 0008
real 0m16.366s
user 0m0.883s
sys 0m8.916s

time ./kmod.sh -t 0009
real 0m50.803s
user 0m0.791s
sys 0m9.852s

Link: http://lkml.kernel.org/r/20170628223155.26472-4-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Reviewed-by: Petr Mladek
Cc: Jessica Yu
Cc: Shuah Khan
Cc: Rusty Russell
Cc: Michal Marek
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-15 06:05:13 +0800
5f92a7b0f kernel/watchdog.c: use better pr_fmt prefix ... Browse Code »

After commit 73ce0511c436 ("kernel/watchdog.c: move hardlockup
detector to separate file"), 'NMI watchdog' is inappropriate in
kernel/watchdog.c, using 'watchdog' only.

Link: http://lkml.kernel.org/r/1499928642-48983-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang
Cc: Babu Moger
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kefeng Wang
2017-07-15 06:05:13 +0800

14 Jul, 2017

3 commits

a252c258d Merge branches 'pm-cpufreq-sched' and 'intel_pstate' ... Browse Code »

* pm-cpufreq-sched:
cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race

* intel_pstate:
cpufreq: intel_pstate: Fix ratio setting for min_perf_pct

Rafael J. Wysocki
2017-07-14 19:16:16 +0800
bc0f51d35 Merge tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull more tracing updates from Steven Rostedt:
"A few more minor updates:

- Show the tgid mappings for user space trace tools to use

- Fix and optimize the comm and tgid cache recording

- Sanitize derived kprobe names

- Ftrace selftest updates

- trace file header fix

- Update of Documentation/trace/ftrace.txt

- Compiler warning fixes

- Fix possible uninitialized variable"

* tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix uninitialized variable in match_records()
ftrace: Remove an unneeded NULL check
ftrace: Hide cached module code for !CONFIG_MODULES
tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE
tracing: Update Documentation/trace/ftrace.txt
tracing: Fixup trace file header alignment
selftests/ftrace: Add a testcase for kprobe event naming
selftests/ftrace: Add a test to probe module functions
selftests/ftrace: Update multiple kprobes test for powerpc
trace/kprobes: Sanitize derived event names
tracing: Attempt to record other information even if some fail
tracing: Treat recording tgid for idle task as a success
tracing: Treat recording comm for idle task as a success
tracing: Add saved_tgids file to show cached pid to tgid mappings

Linus Torvalds
2017-07-14 04:17:19 +0800
ad51271af Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge yet more updates from Andrew Morton:

- various misc things

- kexec updates

- sysctl core updates

- scripts/gdb udpates

- checkpoint-restart updates

- ipc updates

- kernel/watchdog updates

- Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

- "stackprotector: ascii armor the stack canary"

- more MM bits

- checkpatch updates

* emailed patches from Andrew Morton : (96 commits)
writeback: rework wb_[dec|inc]_stat family of functions
ARM: samsung: usb-ohci: move inline before return type
video: fbdev: omap: move inline before return type
video: fbdev: intelfb: move inline before return type
USB: serial: safe_serial: move __inline__ before return type
drivers: tty: serial: move inline before return type
drivers: s390: move static and inline before return type
x86/efi: move asmlinkage before return type
sh: move inline before return type
MIPS: SMP: move asmlinkage before return type
m68k: coldfire: move inline before return type
ia64: sn: pci: move inline before type
ia64: move inline before return type
FRV: tlbflush: move asmlinkage before return type
CRIS: gpio: move inline before return type
ARM: HP Jornada 7XX: move inline before return type
ARM: KVM: move asmlinkage before type
checkpatch: improve the STORAGE_CLASS test
mm, migration: do not trigger OOM killer when migrating memory
drm/i915: use __GFP_RETRY_MAYFAIL
...

Linus Torvalds
2017-07-14 03:38:49 +0800

13 Jul, 2017

18 commits

3a75ad145 Merge tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux ... Browse Code »

Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.13 merge window:

- Minor code cleanups

- Avoid accessing mod struct prior to checking module struct version,
from Kees

- Fix racy atomic inc/dec logic of kmod_concurrent_max in kmod, from
Luis"

* tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: make the modinfo name const
kmod: reduce atomic operations on kmod_concurrent and simplify
module: use list_for_each_entry_rcu() on find_module_all()
kernel/module.c: suppress warning about unused nowarn variable
module: Add module name to modinfo
module: Pass struct load_info into symbol checks

Linus Torvalds
2017-07-13 08:22:01 +0800
7cd815bce fork,random: use get_random_canary() to set tsk->stack_canary ... Browse Code »

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened
tree.

Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com
Signed-off-by: Rik van Riel
Acked-by: Kees Cook
Cc: Daniel Micay
Cc: "Theodore Ts'o"
Cc: H. Peter Anvin
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Catalin Marinas
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2017-07-13 07:26:03 +0800
e2ae8ab4b kexec_file: adjust declaration of kexec_purgatory ... Browse Code »

Defining kexec_purgatory as a zero-length char array upsets compile time
size checking. Since this is built on a per-arch basis, define it as an
unsized char array (like is done for other similar things, e.g. linker
sections). This silences the warning generated by the future
CONFIG_FORTIFY_SOURCE, which did not like the memcmp() of a "0 byte"
array. This drops the __weak and uses an extern instead, since both
users define kexec_purgatory.

Link: http://lkml.kernel.org/r/1497903987-21002-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook
Acked-by: "Eric W. Biederman"
Cc: Daniel Micay
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2017-07-13 07:26:02 +0800
a10a842ff kernel/watchdog: provide watchdog_nmi_reconfigure() for arch watchdogs ... Browse Code »

After reconfiguring watchdog sysctls etc., architecture specific
watchdogs may not get all their parameters updated.

watchdog_nmi_reconfigure() can be implemented to pull the new values in
and set the arch NMI watchdog.

[npiggin@gmail.com: add code comments]
Link: http://lkml.kernel.org/r/20170617125933.774d3858@roar.ozlabs.ibm.com
[arnd@arndb.de: hide unused function]
Link: http://lkml.kernel.org/r/20170620204854.966601-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170616065715.18390-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin
Signed-off-by: Arnd Bergmann
Reviewed-by: Don Zickus
Tested-by: Babu Moger [sparc]
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-07-13 07:26:02 +0800
05a4a9527 kernel/watchdog: split up config options ... Browse Code »

Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

LOCKUP_DETECTOR implies the general boot, sysctl, and programming
interfaces for the lockup detectors.

An architecture that wants to use a hard lockup detector must define
HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
does not implement the LOCKUP_DETECTOR interfaces.

sparc is unusual in that it has started to implement some of the
interfaces, but not fully yet. It should probably be converted to a full
HAVE_HARDLOCKUP_DETECTOR_ARCH.

[npiggin@gmail.com: fix]
Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin
Reviewed-by: Don Zickus
Reviewed-by: Babu Moger
Tested-by: Babu Moger [sparc]
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-07-13 07:26:02 +0800
f2e0cff85 kernel/watchdog: introduce arch_touch_nmi_watchdog() ... Browse Code »

For architectures that define HAVE_NMI_WATCHDOG, instead of having them
provide the complete touch_nmi_watchdog() function, just have them
provide arch_touch_nmi_watchdog().

This gives the generic code more flexibility in implementing this
function, and arch implementations don't miss out on touching the
softlockup watchdog or other generic details.

Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin
Reviewed-by: Don Zickus
Reviewed-by: Babu Moger
Tested-by: Babu Moger [sparc]
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-07-13 07:26:02 +0800
e41d58185 fault-inject: support systematic fault injection ... Browse Code »

Add /proc/self/task//fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:

"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"

Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).

The proposed interface solves all of the above (see the example).

We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:

https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov
Cc: Akinobu Mita
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Vyukov
2017-07-13 07:26:01 +0800
0791e3644 kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files ... Browse Code »

With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

- fill kcmp_epoll_slot structure with epoll file descriptor,
target file number and target file offset (in case if only
one target is present then it should be 0)

- call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
- the kernel fetch file pointer matching file descriptor @fd of pid1
- lookups for file struct in epoll queue of pid2 and returns traditional
0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov
Acked-by: Andrey Vagin
Cc: Al Viro
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Jason Baron
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2017-07-13 07:26:01 +0800
9380fa60b kernel/sysctl_binary.c: check name array length in deprecated_sysctl_warning() ... Browse Code »

Prevent use of uninitialized memory (originating from the stack frame of
do_sysctl()) by verifying that the name array is filled with sufficient
input data before comparing its specific entries with integer constants.

Through timing measurement or analyzing the kernel debug logs, a
user-mode program could potentially infer the results of comparisons
against the uninitialized memory, and acquire some (very limited)
information about the state of the kernel stack. The change also
eliminates possible future warnings by tools such as KMSAN and other
code checkers / instrumentations.

Link: http://lkml.kernel.org/r/20170524122139.21333-1-mjurczyk@google.com
Signed-off-by: Mateusz Jurczyk
Acked-by: Kees Cook
Cc: "David S. Miller"
Cc: Matthew Whitehead
Cc: "Eric W. Biederman"
Cc: Tetsuo Handa
Cc: Alexander Potapenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mateusz Jurczyk
2017-07-13 07:26:00 +0800
61d9b56a8 sysctl: add unsigned int range support ... Browse Code »

To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.

Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.

Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Acked-by: Kees Cook
Cc: Subash Abhinov Kasiviswanathan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
4f2fec00a sysctl: simplify unsigned int support ... Browse Code »

Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed. Two fixes
have come in since then for the following issues:

o Printing the values shows a negative value, this happens since
do_proc_dointvec() and this uses proc_put_long()

This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
flag for proc_douintvec").

o We can easily wrap around the int values: UINT_MAX is 4294967295, if
we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
end up with 1.
o We echo negative values in and they are accepted

This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").

It still also failed to be added to sysctl_check_table()... instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.

Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int. An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:

o How many sysctl proc_dointvec() (int) users exist which likely
should be moved over to proc_douintvec() (unsigned int) ?
Answer: about 8
- Of these how many are array users ?
Answer: Probably only 1
o How many sysctl array users exist ?
Answer: about 12

This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.

The identified uint ports are:

drivers/infiniband/core/ucma.c - max_backlog
drivers/infiniband/core/iwcm.c - default_backlog
net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
net/phonet/sysctl.c proc_local_port_range()

The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well. Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.

If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:

sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Call Trace:
dump_stack+0x63/0x81
__register_sysctl_table+0x350/0x650
? kmem_cache_alloc_trace+0x107/0x240
__register_sysctl_paths+0x1b3/0x1e0
? 0xffffffffc005f000
register_sysctl_table+0x1f/0x30
test_sysctl_init+0x10/0x1000 [test_sysctl]
do_one_initcall+0x52/0x1a0
? kmem_cache_alloc_trace+0x107/0x240
do_init_module+0x5f/0x200
load_module+0x1867/0x1bd0
? __symbol_put+0x60/0x60
SYSC_finit_module+0xdf/0x110
SyS_finit_module+0xe/0x10
entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Suggested-by: Alexey Dobriyan
Cc: Subash Abhinov Kasiviswanathan
Cc: Liping Zhang
Cc: Alexey Dobriyan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: Al Viro
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
d383d4847 sysctl: fold sysctl_writes_strict checks into helper ... Browse Code »

The mode sysctl_writes_strict positional checks keep being copy and pasted
as we add new proc handlers. Just add a helper to avoid code duplication.

Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Suggested-by: Kees Cook
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
a19ac3374 sysctl: kdoc'ify sysctl_writes_strict ... Browse Code »

Document the different sysctl_writes_strict modes in code.

Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
1229384f5 kdump: protect vmcoreinfo data under the crash memory ... Browse Code »

Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
has the risk of being modified by some wrong code during system is
running.

As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore, we
probably will get "Segmentation fault" or other unexpected errors.

E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.

Now except for vmcoreinfo, all the crash data is well
protected(including the cpu note which is fully updated in the crash
path, thus its correctness is guaranteed). Given that vmcoreinfo data
is a large chunk prepared for kdump, we better protect it as well.

To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash
memory will be protected by existing arch_kexec_protect_crashkres()
mechanism, we naturally protect vmcoreinfo_data from write(even read)
access under kernel direct mapping after kdump is loaded.

Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.

On the other hand, we still need to operate the vmcoreinfo safe copy
when crash happens to generate vmcoreinfo_note again, we rely on vmap()
to map out a new kernel virtual address and update to use this new one
instead in the following crash_save_vmcoreinfo().

BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.

Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang
Tested-by: Michael Holzheu
Cc: Benjamin Herrenschmidt
Cc: Dave Young
Cc: Eric Biederman
Cc: Hari Bathini
Cc: Juergen Gross
Cc: Mahesh Salgaonkar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xunlei Pang
2017-07-13 07:26:00 +0800
5203f4995 powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for phdr ... Browse Code »

vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we
should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE.

Like explained in commit 77019967f06b ("kdump: fix exported size of
vmcoreinfo note"), it should not affect the actual function, but we
better fix it, also this change should be safe and backward compatible.

After this, we can get rid of variable vmcoreinfo_max_size, let's use
the corresponding macros directly, fewer variables means more safety for
vmcoreinfo operation.

[xlpang@redhat.com: fix build warning]
Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang
Reviewed-by: Mahesh Salgaonkar
Reviewed-by: Dave Young
Cc: Hari Bathini
Cc: Benjamin Herrenschmidt
Cc: Eric Biederman
Cc: Juergen Gross
Cc: Michael Holzheu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xunlei Pang
2017-07-13 07:25:59 +0800
203e9e412 kexec: move vmcoreinfo out of the kernel's .bss section ... Browse Code »

As Eric said,
"what we need to do is move the variable vmcoreinfo_note out of the
kernel's .bss section. And modify the code to regenerate and keep this
information in something like the control page.

Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures. I clearly was not
watching closely the data someone decided to keep this silly thing in
the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables, one
advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang
Tested-by: Michael Holzheu
Reviewed-by: Juergen Gross
Suggested-by: Eric Biederman
Cc: Benjamin Herrenschmidt
Cc: Dave Young
Cc: Hari Bathini
Cc: Mahesh Salgaonkar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xunlei Pang
2017-07-13 07:25:59 +0800
112166f88 kernel/fork.c: virtually mapped stacks: do not disable interrupts ... Browse Code »

The reason to disable interrupts seems to be to avoid switching to a
different processor while handling per cpu data using individual loads and
stores. If we use per cpu RMV primitives we will not have to disable
interrupts.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
Signed-off-by: Christoph Lameter
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2017-07-13 07:25:59 +0800
58c7ffc07 fix a braino in compat_sys_getrlimit() ... Browse Code »

Reported-and-tested-by: Meelis Roos
Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native"
Signed-off-by: Al Viro
Acked-by: David S. Miller
Signed-off-by: Linus Torvalds

Al Viro
2017-07-13 00:15:00 +0800

12 Jul, 2017

3 commits

2e028c4fe ftrace: Fix uninitialized variable in match_records() ... Browse Code »

My static checker complains that if "func" is NULL then "clear_filter"
is uninitialized. This seems like it could be true, although it's
possible something subtle is happening that I haven't seen.

kernel/trace/ftrace.c:3844 match_records()
error: uninitialized symbol 'clear_filter'.

Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda

Cc: stable@vger.kernel.org
Fixes: f0a3b154bd7 ("ftrace: Clarify code for mod command")
Signed-off-by: Dan Carpenter
Signed-off-by: Steven Rostedt (VMware)

Dan Carpenter
2017-07-12 21:48:31 +0800
44925dfff ftrace: Remove an unneeded NULL check ... Browse Code »

"func" can't be NULL and it doesn't make sense to check because we've
already derefenced it.

Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwanda

Signed-off-by: Dan Carpenter
Signed-off-by: Steven Rostedt (VMware)

Dan Carpenter
2017-07-12 21:45:42 +0800
ab2f7cf14 cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race ... Browse Code »

With a shared policy in place, when one of the CPUs in the policy is
hotplugged out and then brought back online, sugov_stop() and
sugov_start() are called in order.

sugov_stop() removes utilization hooks for each CPU in the policy and
does nothing else in the for_each_cpu() loop. sugov_start() on the
other hand iterates through the CPUs in the policy and re-initializes
the per-cpu structure _and_ adds the utilization hook. This implies
that the scheduler is allowed to invoke a CPU's utilization update
hook when the rest of the per-cpu structures have yet to be
re-inited.

Apart from some strange values in tracepoints this doesn't cause a
problem, but if we do end up accessing a pointer from the per-cpu
sugov_cpu structure somewhere in the sugov_update_shared() path,
we will likely see crashes since the memset for another CPU in the
policy is free to race with sugov_update_shared from the CPU that is
ready to go. So let's fix this now to first init all per-cpu
structures, and then add the per-cpu utilization update hooks all at
once.

Signed-off-by: Vikram Mulukutla
Acked-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki

Vikram Mulukutla
2017-07-12 20:47:48 +0800