Eric Lee / smarc-fsl-linux-kernel

23 Jan, 2017

1 commit

24b86839f Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull smp/hotplug fix from Thomas Gleixner:
"Remove an unused variable which is a leftover from the notifier
removal"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove unused but set variable in _cpu_down()

Linus Torvalds
2017-01-23 04:45:47 +0800

19 Jan, 2017

2 commits

ca92e6c7e Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull SMP hotplug update from Thomas Gleixner:
"This contains a trivial typo fix and an extension to the core code for
dynamically allocating states in the prepare stage.

The extension is necessary right now because we need a proper way to
unbreak LTTNG, which iscurrently non functional due to the removal of
the notifiers. Surely it's out of tree, but it's widely used by
distros.

The simple solution would have been to reserve a state for LTTNG, but
I'm not fond about unused crap in the kernel and the dynamic range,
which we admittedly should have done right away, allows us to remove
quite some of the hardcoded states, i.e. those which have no ordering
requirements. So doing the right thing now is better than having an
smaller intermediate solution which needs to be reworked anyway"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Provide dynamic range for prepare stage
perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug

Linus Torvalds
2017-01-19 03:13:41 +0800
49b550fee Merge branch 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU fixes from Ingo Molnar:
"This fixes sporadic ACPI related hangs in synchronize_rcu() that were
caused by the ACPI code mistakenly relying on an aspect of RCU that
was neither promised to work nor reliable but which happened to work -
until in v4.9 we changed the RCU implementation, which made the hangs
more prominent.

Since the mis-use of the RCU facility wasn't properly detected and
prevented either, these fixes make the RCU side work reliably instead
of working around the problem in the ACPI code.

Hence the slightly larger diffstat that goes beyond the normal scope
of RCU fixes in -rc kernels"

* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Narrow early boot window of illegal synchronous grace periods
rcu: Remove cond_resched() from Tiny synchronize_sched()

Linus Torvalds
2017-01-19 02:47:11 +0800

18 Jan, 2017

4 commits

0fec9557f cpu/hotplug: Remove unused but set variable in _cpu_down() ... Browse Code »

After the recent removal of the hotplug notifiers the variable 'hasdied' in
_cpu_down() is set but no longer read, leading to the following GCC warning
when building with 'make W=1':

kernel/cpu.c:767:7: warning: variable ‘hasdied’ set but not used [-Wunused-but-set-variable]

Fix it by removing the variable.

Fixes: 530e9b76ae8f ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
Signed-off-by: Tobias Klauser
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20170117143501.20893-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner

Tobias Klauser
2017-01-18 18:55:09 +0800
0aa0313f9 Merge tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux ... Browse Code »

Pull modules fix from Jessica Yu:

- fix out-of-tree module breakage when it supplies its own definitions
of true and false

* tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
taint/module: Fix problems when out-of-kernel driver defines true or false

Linus Torvalds
2017-01-18 06:49:21 +0800
5eb7c0d04 taint/module: Fix problems when out-of-kernel driver defines true or false ... Browse Code »

Commit 7fd8329ba502 ("taint/module: Clean up global and module taint
flags handling") used the key words true and false as character members
of a new struct. These names cause problems when out-of-kernel modules
such as VirtualBox include their own definitions of true and false.

Fixes: 7fd8329ba502 ("taint/module: Clean up global and module taint flags handling")
Signed-off-by: Larry Finger
Cc: Petr Mladek
Cc: Jessica Yu
Cc: Rusty Russell
Reported-by: Valdis Kletnieks
Reviewed-by: Petr Mladek
Acked-by: Rusty Russell
Signed-off-by: Jessica Yu

Larry Finger
2017-01-18 02:56:45 +0800
4b19a9e20 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Handle multicast packets properly in fast-RX path of mac80211, from
Johannes Berg.

2) Because of a logic bug, the user can't actually force SW
checksumming on r8152 devices. This makes diagnosis of hw
checksumming bugs really annoying. Fix from Hayes Wang.

3) VXLAN route lookup does not take the source and destination ports
into account, which means IPSEC policies cannot be matched properly.
Fix from Martynas Pumputis.

4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
construction properly in fib_dump_info(), from David Ahern.

7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
Schmidt.

8) Openvswitch conntack actions need to maintain a correct checksum,
fix from Lance Richardson.

9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
fact it already checks this, but not in all of the necessary spots.
Fix from Basil Gunn.

10) Action GET operations in the packet scheduler can erroneously bump
the reference count of the entry, making it unreleasable. Fix from
Jamal Hadi Salim. Jamal gives a great set of example command lines
that trigger this in the commit message.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
net sched actions: fix refcnt when GETing of action after bind
net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
net/mlx4_core: Fix racy CQ (Completion Queue) free
net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
net/mlx5e: Fix a -Wmaybe-uninitialized warning
ax25: Fix segfault after sock connection timeout
bpf: rework prog_digest into prog_tag
tipc: allocate user memory with GFP_KERNEL flag
net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
ip6_tunnel: Account for tunnel header in tunnel MTU
mld: do not remove mld souce list info when set link down
be2net: fix MAC addr setting on privileged BE3 VFs
be2net: don't delete MAC on close on unprivileged BE3 VFs
be2net: fix status check in be_cmd_pmac_add()
cpmac: remove hopeless #warning
ravb: do not use zero-length alignment DMA descriptor
mlx4: do not call napi_schedule() without care
openvswitch: maintain correct checksum state in conntrack actions
tcp: fix tcp_fastopen unaligned access complaints on sparc
...

Linus Torvalds
2017-01-18 01:33:10 +0800

17 Jan, 2017

1 commit

f1f7714ea bpf: rework prog_digest into prog_tag ... Browse Code »

Commit 7bd509e311f4 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.

The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.

Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.

Fixes: 7bd509e311f4 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Cc: Andy Lutomirski
Signed-off-by: David S. Miller

Daniel Borkmann
2017-01-17 03:03:31 +0800

16 Jan, 2017

5 commits

4205e4786 cpu/hotplug: Provide dynamic range for prepare stage ... Browse Code »

Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.

Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:

1) Reserve a hotplug state for LTTNG

2) Add a dynamic range for the prepare states.

While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.

Add a dynamic range which allows LTTNG to request states in the prepare
stage.

Reported-and-tested-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Reviewed-by: Mathieu Desnoyers
Cc: Peter Zijlstra
Cc: Sebastian Sewior
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2017-01-16 20:20:05 +0800
3e4f7a495 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulm… ... Browse Code »

…ck/linux-rcu into rcu/urgent

Pull an urgent RCU fix from Paul E. McKenney:

"This series contains a pair of commits that permit RCU synchronous grace
periods (synchronize_rcu() and friends) to work correctly throughout boot.
This eliminates the current "dead time" starting when the scheduler spawns
its first taks and ending when the last of RCU's kthreads is spawned
(this last happens during early_initcall() time). Although RCU's
synchronous grace periods have long been documented as not working
during this time, prior to 4.9, the expedited grace periods worked by
accident, and some ACPI code came to rely on this unintentional behavior.
(Note that this unintentional behavior was -not- reliable. For example,
failures from ACPI could occur on !SMP systems and on systems booting
with the rcu_normal kernel boot parameter.)

Either way, there is a bug that needs fixing, and the 4.9 switch of RCU's
expedited grace periods to workqueues could be considered to have caused
a regression. This series therefore makes RCU's expedited grace periods
operate correctly throughout the boot process. This has been demonstrated
to fix the problems ACPI was encountering, and has the added longer-term
benefit of simplifying RCU's behavior."

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-01-16 14:45:44 +0800
99421c1cb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace fixes from Eric Biederman:
"This tree contains 4 fixes.

The first is a fix for a race that can causes oopses under the right
circumstances, and that someone just recently encountered.

Past that are several small trivial correct fixes. A real issue that
was blocking development of an out of tree driver, but does not appear
to have caused any actual problems for in-tree code. A potential
deadlock that was reported by lockdep. And a deadlock people have
experienced and took the time to track down caused by a cleanup that
removed the code to drop a reference count"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Drop reference added by grab_header in proc_sys_readdir
pid: fix lockdep deadlock warning due to ucount_lock
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
mnt: Protect the mountpoint hashtable with mount_lock

Linus Torvalds
2017-01-16 08:09:50 +0800
a11ce3a4a Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull NOHZ fix from Ingo Molnar:
"This fixes an old NOHZ race where we incorrectly calculate the next
timer interrupt in certain circumstances where hrtimers are pending,
that can cause hard to reproduce stalled-values artifacts in
/proc/stat"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix collision between tick and other hrtimers

Linus Torvalds
2017-01-16 04:00:37 +0800
79078c53b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
driver fixes, plus miscellanous tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Reject non sampling events with precise_ip
perf/x86/intel: Account interrupts for PEBS errors
perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
perf/core: Fix sys_perf_event_open() vs. hotplug
perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
perf/x86: Set pmu->module in Intel PMU modules
perf probe: Fix to probe on gcc generated symbols for offline kernel
perf probe: Fix --funcs to show correct symbols for offline module
perf symbols: Robustify reading of build-id from sysfs
perf tools: Install tools/lib/traceevent plugins with install-bin
tools lib traceevent: Fix prev/next_prio for deadline tasks
perf record: Fix --switch-output documentation and comment
perf record: Make __record_options static
tools lib subcmd: Add OPT_STRING_OPTARG_SET option
perf probe: Fix to get correct modname from elf header
samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
samples/bpf sock_example: Avoid getting ethhdr from two includes
perf sched timehist: Show total scheduling time

Linus Torvalds
2017-01-16 03:37:43 +0800

15 Jan, 2017

2 commits

52d7e48b8 rcu: Narrow early boot window of illegal synchronous grace periods ... Browse Code »

The current preemptible RCU implementation goes through three phases
during bootup. In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running. During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending). In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter). The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase. This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path. The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv"
Reported-by: Borislav Petkov
Signed-off-by: Paul E. McKenney
Tested-by: Stan Kain
Tested-by: Ivan
Tested-by: Emanuel Castelo
Tested-by: Bruno Pesavento
Tested-by: Borislav Petkov
Tested-by: Frederic Bezies
Cc: # 4.9.0-

Paul E. McKenney
2017-01-15 13:23:48 +0800
f466ae66f rcu: Remove cond_resched() from Tiny synchronize_sched() ... Browse Code »

It is now legal to invoke synchronize_sched() at early boot, which causes
Tiny RCU's synchronize_sched() to emit spurious splats. This commit
therefore removes the cond_resched() from Tiny RCU's synchronize_sched().

Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
Signed-off-by: Paul E. McKenney
Cc: # 4.9.0-

Paul E. McKenney
2017-01-15 13:22:20 +0800

14 Jan, 2017

5 commits

475113d93 perf/x86/intel: Account interrupts for PEBS errors ... Browse Code »

It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
...
RIP: 0010:[] [] smp_call_function_single+0xe2/0x140
...
Call Trace:
? trace_hardirqs_on_caller+0xf5/0x1b0
? perf_cgroup_attach+0x70/0x70
perf_install_in_context+0x199/0x1b0
? ctx_resched+0x90/0x90
SYSC_perf_event_open+0x641/0xf90
SyS_perf_event_open+0x9/0x10
do_syscall_64+0x6c/0x1f0
entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar

Jiri Olsa
2017-01-14 18:06:49 +0800
321027c1f perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race ... Browse Code »

Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

f63a8daa5812 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Min Chong
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-01-14 17:56:11 +0800
63cae12bc perf/core: Fix sys_perf_event_open() vs. hotplug ... Browse Code »

There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0 CPU1 CPU2

(current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

switch(t, n);
migrate(t, 2);
switch(p, t);

ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

generic_exec_single()
func();
spin_lock(ctx->lock);
if (task_curr(t)) // false

add_event_to_ctx();
spin_unlock(ctx->lock);

perf_event_context_sched_in();
spin_lock(ctx->lock);
// sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

C C-peterz

{
}

P0(int *x, int *y)
{
int r1;

WRITE_ONCE(*x, 1);
smp_mb();
r1 = READ_ONCE(*y);
}

P1(int *y, int *z)
{
WRITE_ONCE(*y, 1);
smp_store_release(z, 1);
}

P2(int *x, int *z)
{
int r1;
int r2;

r1 = smp_load_acquire(z);
smp_mb();
r2 = READ_ONCE(*x);
}

exists
(0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
x is perf_event_ctxp[],
y is our tasks's CPU, and
z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

Test C-peterz Allowed
States 7
0:r1=0; 2:r1=0; 2:r2=0;
0:r1=0; 2:r1=0; 2:r2=1;
0:r1=0; 2:r1=1; 2:r2=1;
0:r1=1; 2:r1=0; 2:r2=0;
0:r1=1; 2:r1=0; 2:r2=1;
0:r1=1; 2:r1=1; 2:r2=0;
0:r1=1; 2:r1=1; 2:r2=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Observation C-peterz Never 0 7
Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland
Tested-by: Mark Rutland
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Will Deacon
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-01-14 17:56:10 +0800
af54efa4f Merge tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio ... Browse Code »

Pull VFIO fixes from Alex Williamson:

- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)

- Export and make use of has_capability() to fix incorrect use of
ns_capable() for testing task capabilities (Jike Song)

* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
vfio/type1: Remove pid_namespace.h include
vfio iommu type1: fix the testing of capability for remote task
capability: export has_capability
vfio-mdev: remove some dead code
vfio-mdev: buffer overflow in ioctl()
vfio-mdev: return -EFAULT if copy_to_user() fails

Linus Torvalds
2017-01-14 09:35:43 +0800
406732c93 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM fixes from Paolo Bonzini:

- fix for module unload vs deferred jump labels (note: there might be
other buggy modules!)

- two NULL pointer dereferences from syzkaller

- also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
made worse during this merge window, "just" kernel memory leak on
releases

- fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix emulation of "MOV SS, null selector"
KVM: x86: fix NULL deref in vcpu_scan_ioapic
KVM: eventfd: fix NULL deref irqbypass consumer
KVM: x86: Introduce segmented_write_std
KVM: x86: flush pending lapic jump label updates on module unload
jump_labels: API for flushing deferred jump label updates

Linus Torvalds
2017-01-14 09:06:24 +0800

12 Jan, 2017

2 commits

19c816e8e capability: export has_capability ... Browse Code »

has_capability() is sometimes needed by modules to test capability
for specified task other than current, so export it.

Cc: Kirti Wankhede
Signed-off-by: Jike Song
Acked-by: Serge Hallyn
Acked-by: James Morris
Signed-off-by: Alex Williamson

Jike Song
2017-01-12 22:01:56 +0800
b6416e610 jump_labels: API for flushing deferred jump label updates ... Browse Code »

Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Paolo Bonzini

David Matlack
2017-01-12 21:33:16 +0800

11 Jan, 2017

4 commits

24b91e360 nohz: Fix collision between tick and other hrtimers ... Browse Code »

When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.

In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.

Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.

In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:

Task in a full dynticks CPU
----------------------------

* hrtimer A is queued 2 seconds ahead
* the tick is stopped, scheduled 1 second ahead
* tick fires 1 second later
* on tick exit, nohz schedules the tick 1 second ahead but sees
the clockevent device is already programmed to that deadline,
fooled by hrtimer A, the tick isn't rescheduled.
* hrtimer A is cancelled before its deadline
* tick never fires again until an interrupt happens...

In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.

On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.

Reported-by: James Hartsock
Signed-off-by: Frederic Weisbecker
Reviewed-by: Wanpeng Li
Acked-by: Peter Zijlstra
Acked-by: Rik van Riel
Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Frederic Weisbecker
2017-01-11 17:41:33 +0800
2d39b3cd3 signal: protect SIGNAL_UNKILLABLE from unintentional clearing. ... Browse Code »

Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.

This can result in init becoming stoppable/killable after tracing. For
example, running:

while true; do kill -STOP 1; done &
strace -p 1

and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.

Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.

Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles
Acked-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jamie Iles
2017-01-11 10:31:55 +0800
f931ab479 mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} ... Browse Code »

Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.

For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:

if (pgd_val(*pgd)) {
pud = (pud_t *)pgd_page_vaddr(*pgd);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}

pud = alloc_low_page();
paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
page_size_mask);

The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization. This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:

BUG: unable to handle kernel paging request at ffff888ebfff0000
IP: memcpy_erms+0x6/0x10
PGD 2f8e8fc067 PUD 0 /*
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2017-01-11 10:31:54 +0800
7984c27c2 bpf: do not use KMALLOC_SHIFT_MAX ... Browse Code »

Commit 01b3f52157ff ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.

While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.

The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".

Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Christoph Lameter
Cc: Alexei Starovoitov
Cc: Andrey Konovalov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-01-11 10:31:54 +0800

10 Jan, 2017

1 commit

add7c65ca pid: fix lockdep deadlock warning due to ucount_lock ... Browse Code »

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G W
---------------------------------------------------------
swapper/1/0 just changed the state of lock:
(&(&sighand->siglock)->rlock){-.....}, at: [] __lock_task_sighand+0xb6/0x2c0
but this lock took another, HARDIRQ-unsafe lock in the past:
(ucounts_lock){+.+...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of: &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(ucounts_lock);
local_irq_disable();
lock(&(&sighand->siglock)->rlock);
lock(&(&tty->ctrl_lock)->rlock);

lock(&(&sighand->siglock)->rlock);

*** DEADLOCK ***

This patch removes a dependency between rlock and ucount_lock.

Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
Cc: stable@vger.kernel.org
Signed-off-by: Andrei Vagin
Acked-by: Al Viro
Signed-off-by: Eric W. Biederman

Andrei Vagin
2017-01-10 08:34:56 +0800

06 Jan, 2017

1 commit

6989606a7 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fixes from Paul Moore:
"Two small fixes relating to audit's use of fsnotify.

The first patch plugs a leak and the second fixes some lock
shenanigans. The patches are small and I banged on this for an
afternoon with our testsuite and didn't see anything odd"

* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: Fix sleep in atomic
fsnotify: Remove fsnotify_duplicate_mark()

Linus Torvalds
2017-01-06 15:06:06 +0800

04 Jan, 2017

1 commit

be29d20f3 audit: Fix sleep in atomic ... Browse Code »

Audit tree code was happily adding new notification marks while holding
spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
lead to sleeping while holding a spinlock, deadlocks due to lock
inversion, and probably other fun. Fix the problem by acquiring
group->mark_mutex earlier.

CC: Paul Moore
Signed-off-by: Jan Kara
Signed-off-by: Paul Moore

Jan Kara
2017-01-04 04:56:38 +0800

27 Dec, 2016

1 commit

b9d9d6911 smp/hotplug: Undo tglxs brainfart ... Browse Code »

The attempt to prevent overwriting an active state resulted in a
disaster which effectively disables all dynamically allocated hotplug
states.

Cleanup the mess.

Fixes: dc280d936239 ("cpu/hotplug: Prevent overwriting of callbacks")
Reported-by: Markus Trippelsdorf
Reported-by: Boris Ostrovsky
Signed-off-by: Thomas Gleixner
Signed-off-by: Linus Torvalds

Thomas Gleixner
2016-12-27 09:30:24 +0800

26 Dec, 2016

4 commits

3ddc76dfc Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer type cleanups from Thomas Gleixner:
"This series does a tree wide cleanup of types related to
timers/timekeeping.

- Get rid of cycles_t and use a plain u64. The type is not really
helpful and caused more confusion than clarity

- Get rid of the ktime union. The union has become useless as we use
the scalar nanoseconds storage unconditionally now. The 32bit
timespec alike storage got removed due to the Y2038 limitations
some time ago.

That leaves the odd union access around for no reason. Clean it up.

Both changes have been done with coccinelle and a small amount of
manual mopping up"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Get rid of ktime_equal()
ktime: Cleanup ktime_set() usage
ktime: Get rid of the union
clocksource: Use a plain u64 instead of cycle_t

Linus Torvalds
2016-12-26 06:30:04 +0800
b272f732f Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull SMP hotplug notifier removal from Thomas Gleixner:
"This is the final cleanup of the hotplug notifier infrastructure. The
series has been reintgrated in the last two days because there came a
new driver using the old infrastructure via the SCSI tree.

Summary:

- convert the last leftover drivers utilizing notifiers

- fixup for a completely broken hotplug user

- prevent setup of already used states

- removal of the notifiers

- treewide cleanup of hotplug state names

- consolidation of state space

There is a sphinx based documentation pending, but that needs review
from the documentation folks"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/armada-xp: Consolidate hotplug state space
irqchip/gic: Consolidate hotplug state space
coresight/etm3/4x: Consolidate hotplug state space
cpu/hotplug: Cleanup state names
cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
staging/lustre/libcfs: Convert to hotplug state machine
scsi/bnx2i: Convert to hotplug state machine
scsi/bnx2fc: Convert to hotplug state machine
cpu/hotplug: Prevent overwriting of callbacks
x86/msr: Remove bogus cleanup from the error path
bus: arm-ccn: Prevent hotplug callback leak
perf/x86/intel/cstate: Prevent hotplug callback leak
ARM/imx/mmcd: Fix broken cpu hotplug handling
scsi: qedi: Convert to hotplug state machine

Linus Torvalds
2016-12-26 06:05:56 +0800
8b0e19531 ktime: Cleanup ktime_set() usage ... Browse Code »

ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra

Thomas Gleixner
2016-12-26 00:21:22 +0800
2456e8553 ktime: Get rid of the union ... Browse Code »

ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra

Thomas Gleixner
2016-12-26 00:21:22 +0800

25 Dec, 2016

4 commits

a5a1d1c29 clocksource: Use a plain u64 instead of cycle_t ... Browse Code »

There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: John Stultz

Thomas Gleixner
2016-12-25 18:04:12 +0800
530e9b76a cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions ... Browse Code »

hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
register_hotcpu_notifier(), register_cpu_notifier(),
__register_hotcpu_notifier(), __register_cpu_notifier(),
unregister_hotcpu_notifier(), unregister_cpu_notifier(),
__unregister_hotcpu_notifier(), __unregister_cpu_notifier()

are unused now. Remove them and all related code.

Remove also the now pointless cpu notifier error injection mechanism. The
states can be executed step by step and error rollback is the same as cpu
down, so any state transition can be tested w/o requiring the notifier
error injection.

Some CPU hotplug states are kept as they are (ab)used for hotplug state
tracking.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-12-25 17:47:43 +0800
dc280d936 cpu/hotplug: Prevent overwriting of callbacks ... Browse Code »

Developers manage to overwrite states blindly without thought. That's fatal
and hard to debug. Add sanity checks to make it fail.

This requries to restructure the code so that the dynamic state allocation
happens in the same lock protected section as the actual store. Otherwise
the previous assignment of 'Reserved' to the name field would trigger the
overwrite check.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Link: http://lkml.kernel.org/r/20161221192111.675234535@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-12-25 17:47:42 +0800
7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

24 Dec, 2016

2 commits

00198dab3 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"On the kernel side there's two x86 PMU driver fixes and a uprobes fix,
plus on the tooling side there's a number of fixes and some late
updates"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
perf sched timehist: Fix invalid period calculation
perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
perf sched timehist: Enlarge default 'comm_width'
perf sched timehist: Honour 'comm_width' when aligning the headers
perf/x86: Fix overlap counter scheduling bug
perf/x86/pebs: Fix handling of PEBS buffer overflows
samples/bpf: Move open_raw_sock to separate header
samples/bpf: Remove perf_event_open() declaration
samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
tools lib bpf: Add bpf_prog_{attach,detach}
samples/bpf: Switch over to libbpf
perf diff: Do not overwrite valid build id
perf annotate: Don't throw error for zero length symbols
perf bench futex: Fix lock-pi help string
perf trace: Check if MAP_32BIT is defined (again)
samples/bpf: Make perf_event_read() static
uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
samples/bpf: Make samples more libbpf-centric
tools lib bpf: Add flags to bpf_create_map()
tools lib bpf: use __u32 from linux/types.h
...

Linus Torvalds
2016-12-24 08:49:12 +0800
e3ba73070 fsnotify: Remove fsnotify_duplicate_mark() ... Browse Code »

There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.

Signed-off-by: Jan Kara
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore

Jan Kara
2016-12-24 05:40:32 +0800