Eric Lee / smarc-fsl-linux-kernel

18 Aug, 2016

1 commit

184ca8234 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
Fietkau.

2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

3) Fix some tg3 ethtool logic bugs, and one that would cause no
interrupts to be generated when rx-coalescing is set to 0. From
Satish Baddipadige and Siva Reddy Kallam.

4) QLCNIC mailbox corruption and napi budget handling fix from Manish
Chopra.

5) Fix fib_trie logic when walking the trie during /proc/net/route
output than can access a stale node pointer. From David Forster.

6) Several sctp_diag fixes from Phil Sutter.

7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

8) Checksum fixup fixes in bpf from Daniel Borkmann.

9) Memork leaks in nfnetlink, from Liping Zhang.

10) Use after free in rxrpc, from David Howells.

11) Use after free in new skb_array code of macvtap driver, from Jason
Wang.

12) Calipso resource leak, from Colin Ian King.

13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

15) Fix lockdep splats in macsec, from Sabrina Dubroca.

16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
handling.

17) Various tc-action bug fixes, from CONG Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net_sched: allow flushing tc police actions
net_sched: unify the init logic for act_police
net_sched: convert tcf_exts from list to pointer array
net_sched: move tc offload macros to pkt_cls.h
net_sched: fix a typo in tc_for_each_action()
net_sched: remove an unnecessary list_del()
net_sched: remove the leftover cleanup_a()
mlxsw: spectrum: Allow packets to be trapped from any PG
mlxsw: spectrum: Unmap 802.1Q FID before destroying it
mlxsw: spectrum: Add missing rollbacks in error path
mlxsw: reg: Fix missing op field fill-up
mlxsw: spectrum: Trap loop-backed packets
mlxsw: spectrum: Add missing packet traps
mlxsw: spectrum: Mark port as active before registering it
mlxsw: spectrum: Create PVID vPort before registering netdevice
mlxsw: spectrum: Remove redundant errors from the code
mlxsw: spectrum: Don't return upon error in removal path
i40e: check for and deal with non-contiguous TCs
ixgbe: Re-enable ability to toggle VLAN filtering
ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
...

Linus Torvalds
2016-08-18 08:26:58 +0800

13 Aug, 2016

9 commits

747ea55e4 bpf: fix bpf_skb_in_cgroup helper naming ... Browse Code »

While hashing out BPF's current_task_under_cgroup helper bits, it came
to discussion that the skb_in_cgroup helper name was suboptimally chosen.

Tejun says:

So, I think in_cgroup should mean that the object is in that
particular cgroup while under_cgroup in the subhierarchy of that
cgroup. Let's rename the other subhierarchy test to under too. I
think that'd be a lot less confusing going forward.

[...]

It's more intuitive and gives us the room to implement the real
"in" test if ever necessary in the future.

Since this touches uapi bits, we need to change this as long as v4.8
is not yet officially released. Thus, change the helper enum and rename
related bits.

Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Reference: http://patchwork.ozlabs.org/patch/658500/
Suggested-by: Sargun Dhillon
Suggested-by: Tejun Heo
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov

Daniel Borkmann
2016-08-13 12:53:33 +0800
9710cb662 Merge tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management fixes from Rafael Wysocki:
"Two hibernation fixes allowing it to work with the recently added
randomization of the kernel identity mapping base on x86-64 and one
cpufreq driver regression fix.

Specifics:

- Fix the x86 identity mapping creation helpers to avoid the
assumption that the base address of the mapping will always be
aligned at the PGD level, as it may be aligned at the PUD level if
address space randomization is enabled (Rafael Wysocki).

- Fix the hibernation core to avoid executing tracing functions
before restoring the processor state completely during resume
(Thomas Garnier).

- Fix a recently introduced regression in the powernv cpufreq driver
that causes it to crash due to an out-of-bounds array access
(Akshay Adiga)"

* tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / hibernate: Restore processor state before using per-CPU variables
x86/power/64: Always create temporary identity mapping correctly
cpufreq: powernv: Fix crash in gpstate_timer_handler()

Linus Torvalds
2016-08-13 07:23:58 +0800
3bc6d8c15 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fixes from Ingo Molnar:
"Misc fixes: a /dev/rtc regression fix, two APIC timer period
calibration fixes, an ARM clocksource driver fix and a NOHZ
power use regression fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
x86/timers/apic: Inform TSC deadline clockevent device about recalibration
x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
timers: Fix get_next_timer_interrupt() computation
clocksource/arm_arch_timer: Force per-CPU interrupt to be level-triggered

Linus Torvalds
2016-08-13 04:55:06 +0800
0aeeb3e73 Merge branches 'pm-sleep' and 'pm-cpufreq' ... Browse Code »

* pm-sleep:
PM / hibernate: Restore processor state before using per-CPU variables
x86/power/64: Always create temporary identity mapping correctly

* pm-cpufreq:
cpufreq: powernv: Fix crash in gpstate_timer_handler()

Rafael J. Wysocki
2016-08-13 04:53:58 +0800
e6e7214fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Misc fixes: cputime fixes, two deadline scheduler fixes and a cgroups
scheduling fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix omitted ticks passed in parameter
sched/cputime: Fix steal time accounting
sched/deadline: Fix lock pinning warning during CPU hotplug
sched/cputime: Mitigate performance regression in times()/clock_gettime()
sched/fair: Fix typo in sync_throttle()
sched/deadline: Fix wrap-around in DL heap

Linus Torvalds
2016-08-13 04:51:52 +0800
62822e2ec PM / hibernate: Restore processor state before using per-CPU variables ... Browse Code »

Restore the processor state before calling any other functions to
ensure per-CPU variables can be used with KASLR memory randomization.

Tracing functions use per-CPU variables (GS based on x86) and one was
called just before restoring the processor state fully. It resulted
in a double fault when both the tracing & the exception handler
functions tried to use a per-CPU variable.

Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume)
Reported-and-tested-by: Borislav Petkov
Reported-by: Jiri Kosina
Tested-by: Rafael J. Wysocki
Tested-by: Jiri Kosina
Signed-off-by: Thomas Garnier
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki

Thomas Garnier
2016-08-13 04:50:42 +0800
ad83242a8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, plus two uncore-PMU fixes, an uprobes fix, a
perf-cgroups fix and an AUX events fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add enable_box for client MSR uncore
perf/x86/intel/uncore: Fix uncore num_counters
uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions
perf/core: Set cgroup in CPU contexts for new cgroup events
perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash
perf probe ppc64le: Fix probe location when using DWARF
perf probe: Add function to post process kernel trace events
tools: Sync cpufeatures headers with the kernel
toops: Sync tools/include/uapi/linux/bpf.h with the kernel
tools: Sync cpufeatures.h and vmx.h with the kernel
perf probe: Support signedness casting
perf stat: Avoid skew when reading events
perf probe: Fix module name matching
perf probe: Adjust map->reloc offset when finding kernel symbol from map
perf hists: Trim libtraceevent trace_seq buffers
perf script: Add 'bpf-output' field to usage message

Linus Torvalds
2016-08-13 04:21:18 +0800
1f8083c64 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fixes from Ingo Molnar:
"Misc fixes: lockstat fix, futex fix on !MMU systems, big endian fix
for qrwlocks and a race fix for pvqspinlocks"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/pvqspinlock: Fix a bug in qstat_read()
locking/pvqspinlock: Fix double hash race
locking/qrwlock: Fix write unlock bug on big endian systems
futex: Assume all mappings are private on !MMU systems

Linus Torvalds
2016-08-13 03:46:37 +0800
25db69188 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq fix from Ingo Molnar:
"A fix for an MSI regression"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Make sure PCI MSIs are activated early

Linus Torvalds
2016-08-13 03:41:51 +0800

11 Aug, 2016

2 commits

26f2c75cd sched/cputime: Fix omitted ticks passed in parameter ... Browse Code »

Commit:

f9bcf1e0e014 ("sched/cputime: Fix steal time accounting")

... fixes a leak on steal time accounting but forgets to account
the ticks passed in parameters, assuming there is only one to
take into account.

Let's consider that parameter back.

Signed-off-by: Frederic Weisbecker
Acked-by: Wanpeng Li
Cc: Linus Torvalds
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: Wanpeng Li
Cc: linux-tip-commits@vger.kernel.org
Link: http://lkml.kernel.org/r/20160811125822.GB4214@lerouge
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-08-11 22:34:37 +0800
f9bcf1e0e sched/cputime: Fix steal time accounting ... Browse Code »

Commit:

57430218317 ("sched/cputime: Count actually elapsed irq & softirq time")

... didn't take steal time into consideration with passing the noirqtime
kernel parameter.

As Paolo pointed out before:

| Why not? If idle=poll, for example, any time the guest is suspended (and
| thus cannot poll) does count as stolen time.

This patch fixes it by reducing steal time from idle time accounting when
the noirqtime parameter is true. The average idle time drops from 56.8%
to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running
on one pCPU).

Signed-off-by: Wanpeng Li
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Paolo Bonzini
Cc: Peter Zijlstra (Intel)
Cc: Peter Zijlstra
Cc: Radim
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-08-11 17:02:14 +0800

10 Aug, 2016

11 commits

fdbdfefba Merge branch 'linus' into timers/urgent, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-08-10 20:36:23 +0800
c2ace36b8 locking/pvqspinlock: Fix a bug in qstat_read() ... Browse Code »

It's obviously wrong to set stat to NULL. So lets remove it.
Otherwise it is always zero when we check the latency of kick/wake.

Signed-off-by: Pan Xinhui
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Waiman Long
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Pan Xinhui
2016-08-10 20:13:29 +0800
229ce6315 locking/pvqspinlock: Fix double hash race ... Browse Code »

When the lock holder vCPU is racing with the queue head:

CPU 0 (lock holder) CPU1 (queue head)
=================== =================
spin_lock(); spin_lock();
pv_kick_node(): pv_wait_head_or_lock():
if (!lp) {
lp = pv_hash(lock, pn);
xchg(&l->locked, _Q_SLOW_VAL);
}
WRITE_ONCE(pn->state, vcpu_halted);
cmpxchg(&pn->state,
vcpu_halted, vcpu_hashed);
WRITE_ONCE(l->locked, _Q_SLOW_VAL);
(void)pv_hash(lock, pn);

In this case, lock holder inserts the pv_node of queue head into the
hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
restoring/setting vcpu_hashed state after failing adaptive locking
spinning.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Pan Xinhui
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-08-10 20:13:28 +0800
a2071cd76 Merge branch 'linus' into locking/urgent, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-08-10 20:11:54 +0800
c0c8c9fa2 sched/deadline: Fix lock pinning warning during CPU hotplug ... Browse Code »

The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
releasing a pinned lock
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
? dl_task_timer+0x1a1/0x2b0
warn_slowpath_fmt+0x4f/0x60
? sched_clock+0x13/0x20
lock_release+0x690/0x6a0
? enqueue_pushable_dl_task+0x9b/0xa0
? enqueue_task_dl+0x1ca/0x480
_raw_spin_unlock+0x1f/0x40
dl_task_timer+0x1a1/0x2b0
? push_dl_task.part.31+0x190/0x190
WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
unpinning an unpinned lock
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
warn_slowpath_fmt+0x4f/0x60
lock_unpin_lock+0x181/0x1a0
dl_task_timer+0x127/0x2b0
? push_dl_task.part.31+0x190/0x190

As per the comment before this code, its safe to drop the RQ lock
here, and since we (potentially) change rq, unpin and repin to avoid
the splat.

Signed-off-by: Wanpeng Li
[ Rewrote changelog. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Luca Abeni
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-08-10 20:02:55 +0800
6075620b0 sched/cputime: Mitigate performance regression in times()/clock_gettime() ... Browse Code »

Commit:

6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.

Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.

This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.

Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.

pound_clock_gettime:

threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)

pound_times:

threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)

pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:

$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)

The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again

$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)

will produce results to compare with. A comparison table will be output
with:

$ cd mmtests/work/log
$ ../../compare-kernels.sh

the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.

The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be

struct sched_entity *curr = cfs_rq->curr;

and

delta_exec = now - curr->exec_start;

in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.

A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd `
running over the same sequence of number of threads used above (a positive
gain is an improvement):

threads cycles before cycles after gain

2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%

A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd ` tables):

threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%

As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).

pound_clock_gettime:

threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)

pound_times:

threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)

The source code of the two benchmarks follows. To compile the two:

NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done

==== BEGIN pound_times.c ====

struct tms start;

void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}

int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====

==== BEGIN pound_clock_gettime.c ====

void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;

for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}

int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;

for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}

pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====

Suggested-by: Mike Galbraith
Signed-off-by: Giovanni Gherdovich
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar

Giovanni Gherdovich
2016-08-10 19:32:56 +0800
b8922125e sched/fair: Fix typo in sync_throttle() ... Browse Code »

We should update cfs_rq->throttled_clock_task, not
pcfs_rq->throttle_clock_task.

The effects of this bug was probably occasionally erratic
group scheduling, particularly in cgroups-intense workloads.

Signed-off-by: Xunlei Pang
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 55e16d30bd99 ("sched/fair: Rework throttle_count sync")
Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar

Xunlei Pang
2016-08-10 19:32:55 +0800
a23eadfae sched/deadline: Fix wrap-around in DL heap ... Browse Code »

Current code in cpudeadline.c has a bug in re-heapifying when adding a
new element at the end of the heap, because a deadline value of 0 is
temporarily set in the new elem, then cpudl_change_key() is called
with the actual elem deadline as param.

However, the function compares the new deadline to set with the one
previously in the elem, which is 0. So, if current absolute deadlines
grew so much to have negative values as s64, the comparison in
cpudl_change_key() makes the wrong decision. Instead, as from
dl_time_before(), the kernel should handle correctly abs deadlines
wrap-arounds.

This patch fixes the problem with a minimally invasive change that
forces cpudl_change_key() to heapify up in this case.

Signed-off-by: Tommaso Cucinotta
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Luca Abeni
Cc: Juri Lelli
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar

Tommaso Cucinotta
2016-08-10 19:32:55 +0800
db4a83560 perf/core: Set cgroup in CPU contexts for new cgroup events ... Browse Code »

There's a perf stat bug easy to observer on a machine with only one cgroup:

$ perf stat -e cycles -I 1000 -C 0 -G /
# time counts unit events
1.000161699 cycles /
2.000355591 cycles /
3.000565154 cycles /
4.000951350 cycles /

We'd expect some output there.

The underlying problem is that there is an optimization in
perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
if the old and new cgroups in a task switch are the same.

This optimization interacts with the current code in two ways
that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
cgroup event matches the current task. These are:

1. On creation of the first cgroup event in a CPU: In current code,
cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
aforesaid optimization, perf_cgroup_sched_in will run until the next
cgroup switches in that CPU. This may happen late or never happen,
depending on system's number of cgroups, CPU load, etc.

2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
because cpuctx->cgrp == NULL until a cgroup switch occurs and
perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
mirroring what list_del_event does when removing a cgroup event from CPU
context, as introduced in:

commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

With this patch, cpuctx->cgrp is always set/clear when installing/removing
the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
correctly set, event_filter_match works as intended when events are
sched in/out.

After the fix, the output is as expected:

$ perf stat -e cycles -I 1000 -a -G /
# time counts unit events
1.004699159 627342882 cycles /
2.007397156 615272690 cycles /
3.010019057 616726074 cycles /

Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar

David Carrillo-Cisneros
2016-08-10 19:05:52 +0800
0b8f1e2e2 perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash ... Browse Code »

Vegard Nossum reported that perf fuzzing generates a NULL
pointer dereference crash:

> Digging a bit deeper into this, it seems the event itself is getting
> created by perf_event_open() and it gets added to the pmu_event_list
> through:
>
> perf_event_open()
> - perf_event_alloc()
> - account_event()
> - account_pmu_sb_event()
> - attach_sb_event()
>
> so at this point the event is being attached but its ->ctx is still
> NULL. It seems like ->ctx is set just a bit later in
> perf_event_open(), though.
>
> But before that, __schedule() comes along and creates a stack trace
> similar to the one above:
>
> __schedule()
> - __perf_event_task_sched_out()
> - perf_iterate_sb()
> - perf_iterate_sb_cpu()
> - event_filter_match()
> - perf_cgroup_match()
> - __get_cpu_context()
> - (dereference ctx which is NULL)
>
> So I guess the question is... should the event be attached (= put on
> the list) before ->ctx gets set? Or should the cgroup code check for a
> NULL ->ctx?

The latter seems like the simplest solution. Moving the list-add later
creates a bit of a mess.

Reported-by: Vegard Nossum
Tested-by: Vegard Nossum
Tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: David Carrillo-Cisneros
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Fixes: f2fb6bef9251 ("perf/core: Optimize side-band event delivery")
Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-08-10 19:05:51 +0800
a0cba2179 Revert "printk: create pr_<level> functions" ... Browse Code »

This reverts commit 874f9c7da9a4acbc1b9e12ca722579fb50e4d142.

Geert Uytterhoeven reports:
"This change seems to have an (unintendent?) side-effect.

Before, pr_*() calls without a trailing newline characters would be
printed with a newline character appended, both on the console and in
the output of the dmesg command.

After this commit, no new line character is appended, and the output
of the next pr_*() call of the same type may be appended, like in:

- Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
- Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
+ Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"

Joe Perches says:
"No, that is not intentional.

The newline handling code inside vprintk_emit is a bit involved and
for now I suggest a revert until this has all the same behavior as
earlier"

Reported-by: Geert Uytterhoeven
Requested-by: Joe Perches
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-08-10 01:48:18 +0800

09 Aug, 2016

3 commits

46c8f0b07 timers: Fix get_next_timer_interrupt() computation ... Browse Code »

The tick_nohz_stop_sched_tick() routine is not properly
canceling the sched timer when nothing is pending, because
get_next_timer_interrupt() is no longer returning KTIME_MAX in
that case. This causes periodic interrupts when none are needed.

When determining the next interrupt time, we first use
__next_timer_interrupt() to get the first expiring timer in the
timer wheel. If no timer is found, we return the base clock value
plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
timer wheel.

Back in get_next_timer_interrupt(), we set the "expires" value
by converting the timer wheel expiry (in ticks) to a nsec value.
But we don't want to do this if the timer wheel expiry value
indicates no timer; we want to return KTIME_MAX.

Prior to commit 500462a9de65 ("timers: Switch to a non-cascading
wheel") we checked base->active_timers to see if any timers
were active, and if not, we didn't touch the expiry value and so
properly returned KTIME_MAX. Now we don't have active_timers.

To fix this, we now just check the timer wheel expiry value to
see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
try to compute a new value based on it, but instead simply let the
KTIME_MAX value in expires remain.

Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
Signed-off-by: Chris Metcalf
Cc: Frederic Weisbecker
Cc: Christoph Lameter
Cc: John Stultz
Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Thomas Gleixner

Chris Metcalf
2016-08-09 15:31:55 +0800
f3b0946d6 genirq/msi: Make sure PCI MSIs are activated early ... Browse Code »

Bharat Kumar Gogada reported issues with the generic MSI code, where the
end-point ended up with garbage in its MSI configuration (both for the vector
and the message).

It turns out that the two MSI paths in the kernel are doing slightly different
things:

generic MSI: disable MSI -> allocate MSI -> enable MSI -> setup EP
PCI MSI: disable MSI -> allocate MSI -> setup EP -> enable MSI

And it turns out that end-points are allowed to latch the content of the MSI
configuration registers as soon as MSIs are enabled. In Bharat's case, the
end-point ends up using whatever was there already, which is not what you
want.

In order to make things converge, we introduce a new MSI domain flag
(MSI_FLAG_ACTIVATE_EARLY) that is unconditionally set for PCI/MSI. When set,
this flag forces the programming of the end-point as soon as the MSIs are
allocated.

A consequence of this is that we have an extra activate in irq_startup, but
that should be without much consequence.

tglx:

- Several people reported a VMWare regression with PCI/MSI-X passthrough. It
turns out that the patch also cures that issue.

- We need to have a look at the MSI disable interrupt path, where we write
the msg to all zeros without disabling MSI in the PCI device. Is that
correct?

Fixes: 52f518a3a7c2 "x86/MSI: Use hierarchical irqdomains to manage MSI interrupts"
Reported-and-tested-by: Bharat Kumar Gogada
Reported-and-tested-by: Foster Snowhill
Reported-by: Matthias Prager
Reported-by: Jason Taylor
Signed-off-by: Marc Zyngier
Acked-by: Bjorn Helgaas
Cc: linux-pci@vger.kernel.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1468426713-31431-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner

Marc Zyngier
2016-08-09 15:19:32 +0800
574673c23 printk: Remove unnecessary #ifdef CONFIG_PRINTK ... Browse Code »

In commit 874f9c7da9a4 ("printk: create pr_ functions"), new
pr_level defines were added to printk.c.

These new defines are guarded by an #ifdef CONFIG_PRINTK - however,
there is already a surrounding #ifdef CONFIG_PRINTK starting a lot
earlier in line 249 which means the newly introduced #ifdef is
unnecessary.

Let's remove it to avoid confusion.

Signed-off-by: Andreas Ziegler
Cc: Joe Perches
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Ziegler
2016-08-09 02:29:39 +0800

08 Aug, 2016

1 commit

1eff9d322 block: rename bio bi_rw to bi_opf ... Browse Code »

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-08-08 04:41:02 +0800

07 Aug, 2016

1 commit

a6ed3ea65 bpf: restore behavior of bpf_map_update_elem ... Browse Code »

The introduction of pre-allocated hash elements inadvertently broke
the behavior of bpf hash maps where users expected to call
bpf_map_update_elem() without considering that the map can be full.
Some programs do:
old_value = bpf_map_lookup_elem(map, key);
if (old_value) {
... prepare new_value on stack ...
bpf_map_update_elem(map, key, new_value);
}
Before pre-alloc the update() for existing element would work even
in 'map full' condition. Restore this behavior.

The above program could have updated old_value in place instead of
update() which would be faster and most programs use that approach,
but sometimes the values are large and the programs use update()
helper to do atomic replacement of the element.
Note we cannot simply update element's value in-place like percpu
hash map does and have to allocate extra num_possible_cpu elements
and use this extra reserve when the map is full.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-08-07 08:49:19 +0800

06 Aug, 2016

1 commit

db8262787 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Mostly tooling fixes and some late tooling updates, plus two perf
related printk message fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tests bpf: Use SyS_epoll_wait alias
perf tests: objdump output can contain multi byte chunks
perf record: Add --sample-cpu option
perf hists: Introduce output_resort_cb method
perf tools: Move config/Makefile into Makefile.config
perf tests: Add test for bitmap_scnprintf function
tools lib: Add bitmap_and function
tools lib: Add bitmap_scnprintf function
tools lib: Add bitmap_alloc function
tools lib traceevent: Ignore generated library files
perf tools: Fix build failure on perl script context
perf/core: Change log level for duration warning to KERN_INFO
perf annotate: Plug filename string leak
perf annotate: Introduce strerror for handling symbol__disassemble() errors
perf annotate: Rename symbol__annotate() to symbol__disassemble()
perf/x86: Modify error message in virtualized environment
perf target: str_error_r() always returns the buffer it receives
perf annotate: Use pipe + fork instead of popen
perf evsel: Introduce constructor for cycles event

Linus Torvalds
2016-08-06 21:10:36 +0800

05 Aug, 2016

1 commit

2cfd716d2 Merge tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull more powerpc updates from Michael Ellerman:
"These were delayed for various reasons, so I let them sit in next a
bit longer, rather than including them in my first pull request.

Fixes:
- Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
- Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
- Move register_process_table() out of ppc_md from Michael Ellerman

Use jump_label use for [cpu|mmu]_has_feature():
- Add mmu_early_init_devtree() from Michael Ellerman
- Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
- Do hash device tree scanning earlier from Michael Ellerman
- Do radix device tree scanning earlier from Michael Ellerman
- Do feature patching before MMU init from Michael Ellerman
- Check features don't change after patching from Michael Ellerman
- Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
- Convert mmu_has_feature() to returning bool from Michael Ellerman
- Convert cpu_has_feature() to returning bool from Michael Ellerman
- Define radix_enabled() in one place & use static inline from Michael Ellerman
- Add early_[cpu|mmu]_has_feature() from Michael Ellerman
- Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
- jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
- Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
- Remove mfvtb() from Kevin Hao
- Move cpu_has_feature() to a separate file from Kevin Hao
- Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
- Add option to use jump label for cpu_has_feature() from Kevin Hao
- Add option to use jump label for mmu_has_feature() from Kevin Hao
- Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
- Annotate jump label assembly from Michael Ellerman

TLB flush enhancements from Aneesh Kumar K.V:
- radix: Implement tlb mmu gather flush efficiently
- Add helper for finding SLBE LLP encoding
- Use hugetlb flush functions
- Drop multiple definition of mm_is_core_local
- radix: Add tlb flush of THP ptes
- radix: Rename function and drop unused arg
- radix/hugetlb: Add helper for finding page size
- hugetlb: Add flush_hugetlb_tlb_range
- remove flush_tlb_page_nohash

Add new ptrace regsets from Anshuman Khandual and Simon Guo:
- elf: Add powerpc specific core note sections
- Add the function flush_tmregs_to_thread
- Enable in transaction NT_PRFPREG ptrace requests
- Enable in transaction NT_PPC_VMX ptrace requests
- Enable in transaction NT_PPC_VSX ptrace requests
- Adapt gpr32_get, gpr32_set functions for transaction
- Enable support for NT_PPC_CGPR
- Enable support for NT_PPC_CFPR
- Enable support for NT_PPC_CVMX
- Enable support for NT_PPC_CVSX
- Enable support for TM SPR state
- Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
- Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
- Enable support for EBB registers
- Enable support for Performance Monitor registers"

* tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc/mm: Move register_process_table() out of ppc_md
powerpc/perf: Fix incorrect event codes in power9-event-list
powerpc/32: Fix early access to cpu_spec relocation
powerpc/ptrace: Enable support for Performance Monitor registers
powerpc/ptrace: Enable support for EBB registers
powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
powerpc/ptrace: Enable support for TM SPR state
powerpc/ptrace: Enable support for NT_PPC_CVSX
powerpc/ptrace: Enable support for NT_PPC_CVMX
powerpc/ptrace: Enable support for NT_PPC_CFPR
powerpc/ptrace: Enable support for NT_PPC_CGPR
powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
powerpc/process: Add the function flush_tmregs_to_thread
elf: Add powerpc specific core note sections
powerpc/mm: remove flush_tlb_page_nohash
powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
...

Linus Torvalds
2016-08-05 21:00:54 +0800

04 Aug, 2016

9 commits

fb1b83d3f Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module updates from Rusty Russell:
"The only interesting thing here is Jessica's patch to add
ro_after_init support to modules. The rest are all trivia"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
extable.h: add stddef.h so "NULL" definition is not implicit
modules: add ro_after_init support
jump_label: disable preemption around __module_text_address().
exceptions: fork exception table content from module.h into extable.h
modules: Add kernel parameter to blacklist modules
module: Do a WARN_ON_ONCE() for assert module mutex not held
Documentation/module-signing.txt: Note need for version info if reusing a key
module: Invalidate signatures on force-loaded modules
module: Issue warnings when tainting kernel
module: fix redundant test.
module: fix noreturn attribute for __module_put_and_exit()

Linus Torvalds
2016-08-04 21:14:38 +0800
1f69bf9c6 jump_label: remove bug.h, atomic.h dependencies for HAVE_JUMP_LABEL ... Browse Code »

The current jump_label.h includes bug.h for things such as WARN_ON().
This makes the header problematic for inclusion by kernel.h or any
headers that kernel.h includes, since bug.h includes kernel.h (circular
dependency). The inclusion of atomic.h is similarly problematic. Thus,
this should make jump_label.h 'includable' from most places.

Link: http://lkml.kernel.org/r/7060ce35ddd0d20b33bf170685e6b0fab816bdf2.1467837322.git.jbaron@akamai.com
Signed-off-by: Jason Baron
Cc: "David S. Miller"
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Chris Metcalf
Cc: Heiko Carstens
Cc: Joe Perches
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2016-08-04 20:50:07 +0800
97f2645f3 tree-wide: replace config_enabled() with IS_ENABLED() ... Browse Code »

The use of config_enabled() against config options is ambiguous. In
practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
author might have used it for the meaning of IS_ENABLED(). Using
IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
clearer.

This commit replaces config_enabled() with IS_ENABLED() where possible.
This commit is only touching bool config options.

I noticed two cases where config_enabled() is used against a tristate
option:

- config_enabled(CONFIG_HWMON)
[ drivers/net/wireless/ath/ath10k/thermal.c ]

- config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
[ drivers/gpu/drm/gma500/opregion.c ]

I did not touch them because they should be converted to IS_BUILTIN()
in order to keep the logic, but I was not sure it was the authors'
intention.

Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Acked-by: Kees Cook
Cc: Stas Sergeev
Cc: Matt Redfearn
Cc: Joshua Kinard
Cc: Jiri Slaby
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Markos Chandras
Cc: "Dmitry V. Levin"
Cc: yu-cheng yu
Cc: James Hogan
Cc: Brian Gerst
Cc: Johannes Berg
Cc: Peter Zijlstra
Cc: Al Viro
Cc: Will Drewry
Cc: Nikolay Martynov
Cc: Huacai Chen
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Daniel Borkmann
Cc: Leonid Yegoshin
Cc: Rafal Milecki
Cc: James Cowgill
Cc: Greg Kroah-Hartman
Cc: Ralf Baechle
Cc: Alex Smith
Cc: Adam Buchbinder
Cc: Qais Yousef
Cc: Jiang Liu
Cc: Mikko Rapeli
Cc: Paul Gortmaker
Cc: Denys Vlasenko
Cc: Brian Norris
Cc: Hidehiro Kawai
Cc: "Luis R. Rodriguez"
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Roland McGrath
Cc: Paul Burton
Cc: Kalle Valo
Cc: Viresh Kumar
Cc: Tony Wu
Cc: Huaitong Han
Cc: Sumit Semwal
Cc: Alexei Starovoitov
Cc: Juergen Gross
Cc: Jason Cooper
Cc: "David S. Miller"
Cc: Oleg Nesterov
Cc: Andrea Gelmini
Cc: David Woodhouse
Cc: Marc Zyngier
Cc: Rabin Vincent
Cc: "Maciej W. Rozycki"
Cc: David Daney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2016-08-04 20:50:07 +0800
444d13ff1 modules: add ro_after_init support ... Browse Code »

Add ro_after_init support for modules by adding a new page-aligned section
in the module layout (after rodata) for ro_after_init data and enabling RO
protection for that section after module init runs.

Signed-off-by: Jessica Yu
Acked-by: Kees Cook
Signed-off-by: Rusty Russell

Jessica Yu
2016-08-04 08:46:55 +0800
bdc9f3735 jump_label: disable preemption around __module_text_address(). ... Browse Code »

Steven reported a warning caused by not holding module_mutex or
rcu_read_lock_sched: his backtrace was corrupted but a quick audit
found this possible cause. It's wrong anyway...

Reported-by: Steven Rostedt
Signed-off-by: Rusty Russell

Rusty Russell
2016-08-04 08:46:54 +0800
be7de5f91 modules: Add kernel parameter to blacklist modules ... Browse Code »

Blacklisting a module in linux has long been a problem. The current
procedure is to use rd.blacklist=module_name, however, that doesn't
cover the case after the initramfs and before a boot prompt (where one
is supposed to use /etc/modprobe.d/blacklist.conf to blacklist
runtime loading). Using rd.shell to get an early prompt is hit-or-miss,
and doesn't cover all situations AFAICT.

This patch adds this functionality of permanently blacklisting a module
by its name via the kernel parameter module_blacklist=module_name.

[v2]: Rusty, use core_param() instead of __setup() which simplifies
things.

[v3]: Rusty, undo wreckage from strsep()

[v4]: Rusty, simpler version of blacklisted()

Signed-off-by: Prarit Bhargava
Cc: Jonathan Corbet
Cc: Rusty Russell
Cc: linux-doc@vger.kernel.org
Signed-off-by: Rusty Russell

Prarit Bhargava
2016-08-04 08:46:53 +0800
9502514f2 module: Do a WARN_ON_ONCE() for assert module mutex not held ... Browse Code »

When running with lockdep enabled, I triggered the WARN_ON() in the
module code that asserts when module_mutex or rcu_read_lock_sched are
not held. The issue I have is that this can also be called from the
dump_stack() code, causing us to enter an infinite loop...

------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
ffff880215e8fa70 ffff880215e8fa70 ffffffff812fc8e3 0000000000000000
ffffffff81d3e55b ffff880215e8fac0 ffffffff8104fc88 ffffffff8104fcab
0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
Call Trace:
[] dump_stack+0x67/0x90
[] __warn+0xcb/0xe9
[] ? warn_slowpath_null+0x5/0x1f
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
ffff880215e8f7a0 ffff880215e8f7a0 ffffffff812fc8e3 0000000000000000
ffffffff81d3e55b ffff880215e8f7f0 ffffffff8104fc88 ffffffff8104fcab
0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
Call Trace:
[] dump_stack+0x67/0x90
[] __warn+0xcb/0xe9
[] ? warn_slowpath_null+0x5/0x1f
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375253c #14
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
ffff880215e8f4d0 ffff880215e8f4d0 ffffffff812fc8e3 0000000000000000
ffffffff81d3e55b ffff880215e8f520 ffffffff8104fc88 ffffffff8104fcab
0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
Call Trace:
[] dump_stack+0x67/0x90
[] __warn+0xcb/0xe9
[] ? warn_slowpath_null+0x5/0x1f
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
[...]

Which gives us rather useless information. Worse yet, there's some race
that causes this, and I seldom trigger it, so I have no idea what
happened.

This would not be an issue if that warning was a WARN_ON_ONCE().

Signed-off-by: Steven Rostedt
Signed-off-by: Rusty Russell

Steven Rostedt
2016-08-04 08:46:52 +0800
1f415a74b bpf: fix method of PTR_TO_PACKET reg id generation ... Browse Code »

Using per-register incrementing ID can lead to
find_good_pkt_pointers() confusing registers which
have completely different values. Consider example:

0: (bf) r6 = r1
1: (61) r8 = *(u32 *)(r6 +76)
2: (61) r0 = *(u32 *)(r6 +80)
3: (bf) r7 = r8
4: (07) r8 += 32
5: (2d) if r8 > r0 goto pc+9
R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
6: (bf) r8 = r7
7: (bf) r9 = r7
8: (71) r1 = *(u8 *)(r7 +0)
9: (0f) r8 += r1
10: (71) r1 = *(u8 *)(r7 +1)
11: (0f) r9 += r1
12: (07) r8 += 32
13: (2d) if r8 > r0 goto pc+1
R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
14: (71) r1 = *(u8 *)(r9 +16)
15: (b7) r7 = 0
16: (bf) r0 = r7
17: (95) exit

We need to get a UNKNOWN_VALUE with imm to force id
generation so lines 0-5 make r7 a valid packet pointer.
We then read two different bytes from the packet and
add them to copies of the constructed packet pointer.
r8 (line 9) and r9 (line 11) will get the same id of 1,
independently. When either of them is validated (line
13) - find_good_pkt_pointers() will also mark the other
as safe. This leads to access on line 14 being mistakenly
considered safe.

Fixes: 969bf05eb3ce ("bpf: direct packet access")
Signed-off-by: Jakub Kicinski
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Kicinski
2016-08-04 02:53:33 +0800
bf0f500bd Merge tag 'trace-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"A few updates and fixes:

- move the suppressing of the __builtin_return_address >0 warning to
the tracing directory only.

- metag recordmcount fix for newer glibc's

- two tracing histogram fixes that were reported by KASAN"

* tag 'trace-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix use-after-free in hist_register_trigger()
tracing: Fix use-after-free in hist_unreg_all/hist_enable_unreg_all
Makefile: Mute warning for __builtin_return_address(>0) for tracing only
ftrace/recordmcount: Work around for addition of metag magic but not relocations

Linus Torvalds
2016-08-04 00:50:06 +0800

03 Aug, 2016

1 commit

27eb6622a config: add android config fragments ... Browse Code »

Copy the config fragments from the AOSP common kernel android-4.4
branch. It is becoming possible to run mainline kernels with Android,
but the kernel defconfigs don't work as-is and debugging missing config
options is a pain. Adding the config fragments into the kernel tree,
makes configuring a mainline kernel as simple as:

make ARCH=arm multi_v7_defconfig android-base.config android-recommended.config

The following non-upstream config options were removed:

CONFIG_NETFILTER_XT_MATCH_QTAGUID
CONFIG_NETFILTER_XT_MATCH_QUOTA2
CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
CONFIG_PPPOLAC
CONFIG_PPPOPNS
CONFIG_SECURITY_PERF_EVENTS_RESTRICT
CONFIG_USB_CONFIGFS_F_MTP
CONFIG_USB_CONFIGFS_F_PTP
CONFIG_USB_CONFIGFS_F_ACC
CONFIG_USB_CONFIGFS_F_AUDIO_SRC
CONFIG_USB_CONFIGFS_UEVENT
CONFIG_INPUT_KEYCHORD
CONFIG_INPUT_KEYRESET

Link: http://lkml.kernel.org/r/1466708235-28593-1-git-send-email-robh@kernel.org
Signed-off-by: Rob Herring
Cc: Amit Pundir
Cc: John Stultz
Cc: Dmitry Shmidt
Cc: Rom Lemarchand
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rob Herring
2016-08-03 07:35:42 +0800