Eric Lee / smarc-fsl-linux-kernel

18 Dec, 2015

1 commit

b4b29f948 locking/osq: Fix ordering of node initialisation in osq_lock ... Browse Code »

The Cavium guys reported a soft lockup on their arm64 machine, caused by
commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"):

mutex_optimistic_spin+0x9c/0x1d0
__mutex_lock_slowpath+0x44/0x158
mutex_lock+0x54/0x58
kernfs_iop_permission+0x38/0x70
__inode_permission+0x88/0xd8
inode_permission+0x30/0x6c
link_path_walk+0x68/0x4d4
path_openat+0xb4/0x2bc
do_filp_open+0x74/0xd0
do_sys_open+0x14c/0x228
SyS_openat+0x3c/0x48
el0_svc_naked+0x24/0x28

This is because in osq_lock we initialise the node for the current CPU:

node->locked = 0;
node->next = NULL;
node->cpu = curr;

and then publish the current CPU in the lock tail:

old = atomic_xchg_acquire(&lock->tail, curr);

Once the update to lock->tail is visible to another CPU, the node is
then live and can be both read and updated by concurrent lockers.

Unfortunately, the ACQUIRE semantics of the xchg operation mean that
there is no guarantee the contents of the node will be visible before
lock tail is updated. This can lead to lock corruption when, for
example, a concurrent locker races to set the next field.

Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
Reported-by: David Daney
Reported-by: Andrew Pinski
Tested-by: Andrew Pinski
Acked-by: Davidlohr Bueso
Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
Signed-off-by: Linus Torvalds

Will Deacon
2015-12-18 03:40:29 +0800

14 Dec, 2015

1 commit

dfd01f026 sched/wait: Fix the signal handling fix ... Browse Code »

Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.

Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek
Reported-by: Chris Mason
Tested-by: Jan Stancek
Tested-by: Vladimir Murzin
Tested-by: Chris Mason
Reviewed-by: Paul Turner
Cc: Ingo Molnar
Cc: tglx@linutronix.de
Cc: Oleg Nesterov
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Linus Torvalds

Peter Zijlstra
2015-12-14 06:30:59 +0800

13 Dec, 2015

1 commit

86fffe4a6 kernel: remove stop_machine() Kconfig dependency ... Browse Code »

Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable. This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.

For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.

This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.

Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson
Acked-by: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: Pranith Kumar
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Cc: H. Peter Anvin
Cc: Tejun Heo
Cc: Iulia Manda
Cc: Andy Lutomirski
Cc: Rusty Russell
Cc: Peter Zijlstra
Cc: Chuck Ebbert
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wilson
2015-12-13 02:15:34 +0800

09 Dec, 2015

2 commits

5406812e5 Merge branch 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"More change than I'd have liked at this stage. The pids controller
and the changes made to cgroup core to support it introduced and
revealed several important issues.

- Assigning membership to a newly created task and migrating it can
race leading to incorrect accounting. Oleg fixed it by widening
threadgroup synchronization. It looks like we'll be able to merge
it with a different percpu rwsem which is used in fork path making
things simpler and cheaper.

- The recent change to extend cgroup membership to zombies (so that
pid accounting can extend till the pid is actually released) missed
pinning the underlying data structures leading to use-after-free.
Fixed.

- v2 hierarchy was calling subsystem callbacks with the wrong target
cgroup_subsys_state based on the incorrect assumption that they
share the same target. pids is the first controller affected by
this. Subsys callbacks updated so that they can deal with
multi-target migrations"

* 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup_pids: don't account for the root cgroup
cgroup: fix handling of multi-destination migration from subtree_control enabling
cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
cgroup: make css_set pin its css's to avoid use-afer-free
cgroup: fix cftype->file_offset handling

Linus Torvalds
2015-12-09 05:35:52 +0800
51825c8a8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"This tree includes four core perf fixes for misc bugs, three fixes to
x86 PMU drivers, and two updates to old email addresses"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not send exit event twice
perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
perf: Fix PERF_EVENT_IOC_PERIOD deadlock
treewide: Remove old email address
perf/x86: Fix LBR call stack save/restore
perf: Update email address in MAINTAINERS
perf/core: Robustify the perf_cgroup_from_task() RCU checks
perf/core: Fix RCU problem with cgroup context switching code

Linus Torvalds
2015-12-09 05:01:23 +0800

07 Dec, 2015

2 commits

0b98f0c04 Merge branch 'master' into for-4.4-fixes ... Browse Code »

The following commit which went into mainline through networking tree

3b13758f51de ("cgroups: Allow dynamically changing net_classid")

conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.

1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths. The latter
drops @css from cgrp_attach().

Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task. We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.

Signed-off-by: Tejun Heo
Reported-by: Stephen Rothwell
Cc: Nina Schiff

Tejun Heo
2015-12-07 23:09:03 +0800
fb7b26e47 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"This updates contains the following changes:

- Fix a signal handling regression in the bit wait functions.

- Avoid false positive warnings in the wakeup path.

- Initialize the scheduler root domain properly.

- Handle gtime calculations in proc/$PID/stat proper.

- Add more documentation for the barriers in try_to_wake_up().

- Fix a subtle race in try_to_wake_up() which might cause a task to
be scheduled on two cpus

- Compile static helper function only when it is used"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
sched/core: Better document the try_to_wake_up() barriers
sched/cputime: Fix invalid gtime in proc
sched/core: Clear the root_domain cpumasks in init_rootdomain()
sched/core: Remove false-positive warning from wake_up_process()
sched/wait: Fix signal handling in bit wait helpers
sched/rt: Hide the push_irq_work_func() declaration

Linus Torvalds
2015-12-07 00:35:05 +0800

06 Dec, 2015

1 commit

4e93ad601 perf: Do not send exit event twice ... Browse Code »

In case we monitor events system wide, we get EXIT event
(when configured) twice for each task that exited.

Note doubled lines with same pid/tid in following example:

$ sudo ./perf record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
$ sudo ./perf report -D | grep EXIT

0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)

The reason is that the cpu contexts are processes each time
we call perf_event_task. I'm changing the perf_event_aux logic
to serve task_ctx and cpu contexts separately, which ensure we
don't get EXIT event generated twice on same cpu context.

This does not affect other auxiliary events, as they don't
use task_ctx at all.

Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar

Jiri Olsa
2015-12-06 19:54:49 +0800

04 Dec, 2015

9 commits

ecf7d01c2 sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() ... Browse Code »

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

CPU0 CPU1

set_current_state(...)

context_switch(X, Y)
prepare_lock_switch(Y)
Y->on_cpu = 1;
finish_lock_switch(X)
store_release(X->on_cpu, 0);

try_to_wake_up(X)
LOCK(p->pi_lock);

t = X->on_cpu; // 0

context_switch(Y, X)
prepare_lock_switch(X)
X->on_cpu = 1;
finish_lock_switch(Y)
store_release(Y->on_cpu, 0);

schedule();
deactivate_task(X);
X->on_rq = 0;

if (X->on_rq) // false

if (t) while (X->on_cpu)
cpu_relax();

context_switch(X, ..)
finish_lock_switch(X)
store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-12-04 17:26:43 +0800
b75a22531 sched/core: Better document the try_to_wake_up() barriers ... Browse Code »

Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-12-04 17:26:42 +0800
2541117b0 sched/cputime: Fix invalid gtime in proc ... Browse Code »

/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.

This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.

The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:

# while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
S 4348
R 7064566
R 7064766
R 7064967
R 7065168
S 4759
S 4759

During running busy loop, it returns large value.

After applying this patch, we can see right gtime.

# while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
S 5338
R 5365
R 5465
R 5566
R 5666
S 5726
S 5726

Signed-off-by: Hiroshi Shimamoto
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E . McKenney
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Hiroshi Shimamoto
2015-12-04 17:18:49 +0800
8295c6992 sched/core: Clear the root_domain cpumasks in init_rootdomain() ... Browse Code »

root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.

Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar

Xunlei Pang
2015-12-04 17:16:21 +0800
119d6f6a3 sched/core: Remove false-positive warning from wake_up_process() ... Browse Code »

Because wakeups can (fundamentally) be late, a task might not be in
the expected state. Therefore testing against a task's state is racy,
and can yield false positives.

Signed-off-by: Sasha Levin
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: oleg@redhat.com
Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Ingo Molnar

Sasha Levin
2015-12-04 17:10:16 +0800
68985633b sched/wait: Fix signal handling in bit wait helpers ... Browse Code »

Vladimir reported getting RCU stall warnings and bisected it back to
commit:

743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")

That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.

Reported-by: Vladimir Murzin
Tested-by: Vladimir Murzin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-12-04 17:10:15 +0800
642c2d671 perf: Fix PERF_EVENT_IOC_PERIOD deadlock ... Browse Code »

Dmitry reported a fairly silly recursive lock deadlock for
PERF_EVENT_IOC_PERIOD, fix this by explicitly doing the inactive part of
__perf_event_period() instead of calling that function.

Reported-by: Dmitry Vyukov
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Potapenko
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Eric Dumazet
Cc: Jiri Olsa
Cc: Kostya Serebryany
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: c7999c6f3fed ("perf: Fix PERF_EVENT_IOC_PERIOD migration race")
Link: http://lkml.kernel.org/r/20151130115615.GJ17308@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-12-04 17:08:03 +0800
071f5d105 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"A lot of Thanksgiving turkey leftovers accumulated, here goes:

1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

2) IDs for some new iwlwifi chips, from Oren Givon.

3) Fix rtlwifi lockups on boot, from Larry Finger.

4) Fix memory leak in fm10k, from Stephen Hemminger.

5) We have a route leak in the ipv6 tunnel infrastructure, fix from
Paolo Abeni.

6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

7) Wrong lockdep annotations in tcp md5 support, fix from Eric
Dumazet.

8) Work around some middle boxes which prevent proper handling of TCP
Fast Open, from Yuchung Cheng.

9) TCP repair can do huge kmalloc() requests, build paged SKBs
instead. From Eric Dumazet.

10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
Borkmann.

11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
Nikolay Aleksandrov.

12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
Weikusat.

13) Fix double free in VRF code, from Nikolay Aleksandrov.

14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

16) Fix clearing of persistent array maps in bpf, from Daniel
Borkmann.

17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
early enough. From Eric Dumazet.

18) Fix out of bounds accesses in bpf array implementation when
updating elements, from Daniel Borkmann.

19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
Dumazet.

20) When dumping proxy neigh entries, we have to accomodate NULL
device pointers properly, from Konstantin Khlebnikov.

21) SCTP doesn't release all ipv6 socket resources properly, fix from
Eric Dumazet.

22) Prevent underflows of sch->q.qlen for multiqueue packet
schedulers, also from Eric Dumazet.

23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
Huang and Michael Chan.

24) Don't actively scan radar channels, from Antonio Quartulli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
net: phy: reset only targeted phy
bnxt_en: Setup uc_list mac filters after resetting the chip.
bnxt_en: enforce proper storing of MAC address
bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
net: lpc_eth: remove irq > NR_IRQS check from probe()
net_sched: fix qdisc_tree_decrease_qlen() races
openvswitch: fix hangup on vxlan/gre/geneve device deletion
ipv4: igmp: Allow removing groups from a removed interface
ipv6: sctp: implement sctp_v6_destroy_sock()
arm64: bpf: add 'store immediate' instruction
ipv6: kill sk_dst_lock
ipv6: sctp: add rcu protection around np->opt
net/neighbour: fix crash at dumping device-agnostic proxy entries
sctp: use GFP_USER for user-controlled kmalloc
sctp: convert sack_needed and sack_generation to bits
ipv6: add complete rcu protection around np->opt
bpf: fix allocation warnings in bpf maps and integer overflow
mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
net: mvneta: enable setting custom TX IP checksum limit
net: mvneta: fix error path for building skb
...

Linus Torvalds
2015-12-04 08:02:46 +0800
c041f0873 Merge tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fix from Steven Rostedt:
"During the merge window I added a new file that is used to filter
trace events on pids. It filters all events where only tasks with
their pid in that file exists. It also handles the sched_switch and
sched_wakeup trace events where the current task does not have its pid
in the file, but the task either being switched to or awaken does.

Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both
of these tracepoints use the same class as the sched_wakeup
tracepoint, and they too should be included in what gets filtered by
the set_event_pid file"

* tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter

Linus Torvalds
2015-12-04 07:23:17 +0800

03 Dec, 2015

4 commits

67cde9c49 cgroup_pids: don't account for the root cgroup ... Browse Code »

Because accounting resources for the root cgroup sometimes incurs
measureable overhead for workloads which don't care about cgroup and
often ends up calculating a number which is available elsewhere in a
slightly different form, cgroup is not in the business of providing
system-wide statistics. The pids controller which was introduced
recently was exposing "pids.current" at the root. This patch disable
accounting for root cgroup and removes the file from the root
directory.

While this is a userland visible behavior change, pids has been
available only in one version and was badly broken there, so I don't
think this will be noticeable. If it turns out to be a problem, we
can reinstate it for v1 hierarchies.

Signed-off-by: Tejun Heo
Cc: Aleksa Sarai

Tejun Heo
2015-12-03 23:18:21 +0800
1f7dd3e5a cgroup: fix handling of multi-destination migration from subtree_control enabling ... Browse Code »

Consider the following v2 hierarchy.

P0 (+memory) --- P1 (-memory) --- A
\- B

P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[] dump_stack+0x4e/0x82
[] warn_slowpath_common+0x82/0xc0
[] warn_slowpath_null+0x1a/0x20
[] pids_cancel.constprop.6+0x31/0x40
[] pids_can_attach+0x6d/0xf0
[] cgroup_taskset_migrate+0x6c/0x330
[] cgroup_migrate+0xf5/0x190
[] cgroup_attach_task+0x176/0x200
[] __cgroup_procs_write+0x2ad/0x460
[] cgroup_procs_write+0x14/0x20
[] cgroup_file_write+0x35/0x1c0
[] kernfs_fop_write+0x141/0x190
[] __vfs_write+0x28/0xe0
[] vfs_write+0xac/0x1a0
[] SyS_write+0x49/0xb0
[] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.

* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo
Reported-and-tested-by: Daniel Wagner
Cc: Aleksa Sarai

Tejun Heo
2015-12-03 23:18:21 +0800
599c963a0 cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() ... Browse Code »

If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen. freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.

This patch merges the two steps. Propagation now takes place inside
the task loop. This simplifies the code and prepares it for the fix
of multi-destination migration.

Signed-off-by: Tejun Heo

Tejun Heo
2015-12-03 23:18:21 +0800
01b3f5215 bpf: fix allocation warnings in bpf maps and integer overflow ... Browse Code »

For large map->value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0x68/0x92 lib/dump_stack.c:50
[] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
[] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
[< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989
[] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
[] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
[< inline >] alloc_pages include/linux/gfp.h:451
[] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
[] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
[] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
[< inline >] kmalloc_large include/linux/slab.h:390
[] __kmalloc+0x234/0x250 mm/slub.c:3525
[< inline >] kmalloc include/linux/slab.h:463
[< inline >] map_update_elem kernel/bpf/syscall.c:288
[< inline >] SYSC_bpf kernel/bpf/syscall.c:744

To avoid never succeeding kmalloc with order >= MAX_ORDER check that
elem->value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-12-03 12:36:00 +0800

02 Dec, 2015

2 commits

fbca9d2d3 bpf, array: fix heap out-of-bounds access when updating elements ... Browse Code »

During own review but also reported by Dmitry's syzkaller [1] it has been
noticed that we trigger a heap out-of-bounds access on eBPF array maps
when updating elements. This happens with each map whose map->value_size
(specified during map creation time) is not multiple of 8 bytes.

In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
used to align array map slots for faster access. However, in function
array_map_update_elem(), we update the element as ...

memcpy(array->value + array->elem_size * index, value, array->elem_size);

... where we access 'value' out-of-bounds, since it was allocated from
map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
and later on copied through copy_from_user(value, uvalue, map->value_size).
Thus, up to 7 bytes, we can access out-of-bounds.

Same could happen from within an eBPF program, where in worst case we
access beyond an eBPF program's designated stack.

Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
official release yet, it only affects priviledged users.

In case of array_map_lookup_elem(), the verifier prevents eBPF programs
from accessing beyond map->value_size through check_map_access(). Also
from syscall side map_lookup_elem() only copies map->value_size back to
user, so nothing could leak.

[1] http://github.com/google/syzkaller

Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
Reported-by: Dmitry Vyukov
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-12-02 10:56:17 +0800
0f72e37e4 tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter ... Browse Code »

The set_event_pid filter relies on attaching to the sched_switch and
sched_wakeup tracepoints to see if it should filter the tracing on schedule
tracepoints. By adding the callbacks to sched_wakeup, pids in the
set_event_pid file will trace the wakeups of those tasks with those pids.

But sched_wakeup_new and sched_waking were missed. These two should also be
traced. Luckily, these tracepoints share the same class as sched_wakeup
which means they can use the same pre and post callbacks as sched_wakeup
does.

Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-12-02 05:08:05 +0800

01 Dec, 2015

1 commit

9e5d25e82 Merge tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"I found two minor bugs while doing development on the ring buffer
code.

The first is something that's been there since its creation. If a
reader reads a page out of the ring buffer before there's any events
on it, it can get an out of date timestamp for that event. It may be
off by a few microseconds, more if the first event gets discarded.
The fix was to only update the reader time stamp when it actually sees
an event on the page, instead of just reading the timestamp from the
page even if it has no events on it. That timestamp is still volatile
until an event is present.

The second bug is more recent. Instead of passing around parameters a
descriptor was made and the parameters are passed via a single
descriptor. This simplified the code a bit. But there was one place
that expected the parameter to be passed by value not reference (which
a descriptor now does). And it added to the length of the event,
which may be ignored later, but the length should not have been
increased. The only real problem with this bug is that it may
allocate more than was needed for the event"

* tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Put back the length if crossed page with add_timestamp
ring-buffer: Update read stamp with first real commit on page

Linus Torvalds
2015-12-01 07:38:23 +0800

30 Nov, 2015

3 commits

afbcb364b cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork() ... Browse Code »

Now that we know that the forking task can't migrate amd the child is always
moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
And since we no longer need to pin this css, we can remove pid_fork().

Note: the patch uses task_css_check(true), perhaps it makes sense to add a
helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
account.

Signed-off-by: Oleg Nesterov
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Oleg Nesterov
2015-11-30 22:48:18 +0800
c9e75f049 cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() ... Browse Code »

If the new child migrates to another cgroup before cgroup_post_fork() calls
subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
pids_uncharge(old_pids) + pids_charge(pids) sequence twice.

Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
unconditionally. percpu_down_read() is cheap and this allows other cleanups,
see the next changes.

Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.

Signed-off-by: Oleg Nesterov
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Oleg Nesterov
2015-11-30 22:48:18 +0800
53254f900 cgroup: make css_set pin its css's to avoid use-afer-free ... Browse Code »

A css_set represents the relationship between a set of tasks and
css's. css_set never pinned the associated css's. This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.

Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release. Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.

general protection fault: 0000 [#1] PREEMPT SMP
task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
RIP: 0010:[] [] pids_cancel.constprop.4+0x5/0x40
...
Call Trace:

[] ? pids_free+0x3d/0xa0
[] cgroup_free+0x53/0xe0
[] __put_task_struct+0x42/0x130
[] delayed_put_task_struct+0x77/0x130
[] rcu_process_callbacks+0x2f4/0x820
[] ? rcu_process_callbacks+0x2b3/0x820
[] __do_softirq+0xd4/0x460
[] irq_exit+0x89/0xa0
[] smp_apic_timer_interrupt+0x42/0x50
[] apic_timer_interrupt+0x84/0x90

...
Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
RIP [] pids_cancel.constprop.4+0x5/0x40
RSP
---[ end trace 89a4a4b916b90c49 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fix it by making css_set pin the associate css's until its release.

Signed-off-by: Tejun Heo
Reported-by: Dave Jones
Reported-by: Daniel Wagner
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")

Tejun Heo
2015-11-30 22:46:21 +0800

26 Nov, 2015

1 commit

c9da161c6 bpf: fix clearing on persistent program array maps ... Browse Code »

Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-26 01:14:09 +0800

25 Nov, 2015

1 commit

81b1a832d pidns: fix NULL dereference in __task_pid_nr_ns() ... Browse Code »

I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

if (pid && ns->level level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.

Signed-off-by: Eric Dumazet
Signed-off-by: Linus Torvalds

Eric Dumazet
2015-11-25 04:03:55 +0800

24 Nov, 2015

2 commits

bd1b7cd36 ring-buffer: Put back the length if crossed page with add_timestamp ... Browse Code »

Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.

Remove the extra space added if a timestamp was added and the event went
across the page.

Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-11-24 22:27:25 +0800
b81f472a2 ring-buffer: Update read stamp with first real commit on page ... Browse Code »

Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.

rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.

Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-11-24 22:23:17 +0800

23 Nov, 2015

4 commits

90eec103b treewide: Remove old email address ... Browse Code »

There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-11-23 16:44:58 +0800
89b411081 sched/rt: Hide the push_irq_work_func() declaration ... Browse Code »

The push_irq_work_func() function is conditionally defined only
when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
forward declaration remains visibile without HAVE_RT_PUSH_IPI,
causing a gcc warning in ARM64 allnoconfig:

kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]

This changes the code to use the same condition for both the
declaration and the function definition, which gets rid of the
warning.

As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
thing after:

8053871d0f7f ("smp: Fix smp_call_function_single_async() locking")

Until that is done, this patch can be used to avoid the warning.

Signed-off-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Steven Rostedt
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel
Signed-off-by: Ingo Molnar

Arnd Bergmann
2015-11-23 16:25:08 +0800
614e4c4eb perf/core: Robustify the perf_cgroup_from_task() RCU checks ... Browse Code »

This patch reinforces the lockdep checks performed by
perf_cgroup_from_tsk() by passing the perf_event_context
whenever possible. It is okay to not hold the RCU read lock
when we know we hold the ctx->lock. This patch makes sure this
property holds.

In some functions, such as perf_cgroup_sched_in(), we do not
pass the context because we are sure we are holding the RCU
read lock.

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar

Stephane Eranian
2015-11-23 16:21:03 +0800
ddaaf4e29 perf/core: Fix RCU problem with cgroup context switching code ... Browse Code »

The RCU checker detected RCU violation in the cgroup switching routines
perf_cgroup_sched_in() and perf_cgroup_sched_out(). We were dereferencing
cgroup from task without holding the RCU lock.

Fix this by holding the RCU read lock. We move the locking from
perf_cgroup_switch() to avoid double locking.

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar

Stephane Eranian
2015-11-23 16:21:03 +0800

21 Nov, 2015

2 commits

7625b3a00 kernel/panic.c: turn off locks debug before releasing console lock ... Browse Code »

Commit 08d78658f393 ("panic: release stale console lock to always get the
logbuf printed out") introduced an unwanted bad unlock balance report when
panic() is called directly and not from OOPS (e.g. from out_of_memory()).
The difference is that in case of OOPS we disable locks debug in
oops_enter() and on direct panic call nobody does that.

Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out")
Reported-by: kernel test robot
Signed-off-by: Vitaly Kuznetsov
Cc: HATAYAMA Daisuke
Cc: Masami Hiramatsu
Cc: Jiri Kosina
Cc: Baoquan He
Cc: Prarit Bhargava
Cc: Xie XiuQi
Cc: Seth Jennings
Cc: "K. Y. Srinivasan"
Cc: Jan Kara
Cc: Petr Mladek
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Kuznetsov
2015-11-21 08:17:32 +0800
9d8a76521 kernel/signal.c: unexport sigsuspend() ... Browse Code »

sigsuspend() is nowhere used except in signal.c itself, so we can mark it
static do not pollute the global namespace.

But this patch is more than a boring cleanup patch, it fixes a real issue
on UserModeLinux. UML has a special console driver to display ttys using
xterm, or other terminal emulators, on the host side. Vegard reported
that sometimes UML is unable to spawn a xterm and he's facing the
following warning:

WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()

It turned out that this warning makes absolutely no sense as the UML
xterm code calls sigsuspend() on the host side, at least it tries. But
as the kernel itself offers a sigsuspend() symbol the linker choose this
one instead of the glibc wrapper. Interestingly this code used to work
since ever but always blocked signals on the wrong side. Some recent
kernel change made the WARN_ON() trigger and uncovered the bug.

It is a wonderful example of how much works by chance on computers. :-)

Fixes: 68f3f16d9ad0f1 ("new helper: sigsuspend()")
Signed-off-by: Richard Weinberger
Reported-by: Vegard Nossum
Tested-by: Vegard Nossum
Acked-by: Oleg Nesterov
Cc: [3.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Weinberger
2015-11-21 08:17:32 +0800

20 Nov, 2015

1 commit

a3d66b5a1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching ... Browse Code »

Pull livepatching fix from Jiri Kosina:
"A fix for module handling in case kASLR has been enabled, from Zhou
Chengming"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: x86: fix relocation computation with kASLR

Linus Torvalds
2015-11-20 04:16:12 +0800

16 Nov, 2015

2 commits

34c06254f cgroup: fix cftype->file_offset handling ... Browse Code »

6f60eade2433 ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded. These handles
then can be used, for example, to generate file modified
notifications.

Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction. Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again. This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time. This incorrect assumption led
to the corruption of css->files lists.

Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.

Signed-off-by: Tejun Heo
Reported-by: James Sedgwick
Fixes: 6f60eade2433 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: Johannes Weiner
Acked-by: Zefan Li

Tejun Heo
2015-11-16 23:58:26 +0800
0ca9b6760 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Thomas Gleixner:
"Mostly updates to the perf tool plus two fixes to the kernel core code:

- Handle tracepoint filters correctly for inherited events (Peter
Zijlstra)

- Prevent a deadlock in perf_lock_task_context (Paul McKenney)

- Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
Melo)

- Print full source file paths when using 'perf annotate --print-line
--full-paths' (Michael Petlan)

- Fix 'perf probe -d' when just one out of uprobes and kprobes is
enabled (Wang Nan)

- Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)

- Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
the 'perf test' LLVM entries, when running it in-tree, to
.gitignore (Yunlong Song)

- libbpf error reporting improvements, using a strerror interface to
more precisely tell the user about problems with the provided
scriptlet, be it in C or as a ready made object file (Wang Nan)

- Do not be case sensitive when searching for matching 'perf test'
entries (Arnaldo Carvalho de Melo)

- Inform the user about objdump failures in 'perf annotate' (Andi
Kleen)

- Improve the LLVM 'perf test' entry, introduce a new ones for BPF
and kbuild tests to check the environment used by clang to compile
.c scriptlets (Wang Nan)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
tools include: Add compiler.h to list.h
perf probe: Verify parameters in two functions
perf session: Add missing newlines to some pr_err() calls
perf annotate: Support full source file paths for srcline fix
perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
perf: Fix inherited events vs. tracepoint filters
perf: Disable IRQs across RCU RS CS that acquires scheduler lock
perf test: Do not be case sensitive when searching for matching tests
perf test: Add 'perf test BPF'
perf test: Enhance the LLVM tests: add kbuild test
perf test: Enhance the LLVM test: update basic BPF test program
perf bpf: Improve BPF related error messages
perf tools: Make fetch_kernel_version() publicly available
bpf tools: Add new API bpf_object__get_kversion()
bpf tools: Improve libbpf error reporting
perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
perf annotate: Inform the user about objdump failures in --stdio
perf stat: Make stat options global
perf sched latency: Fix thread pid reuse issue
...

Linus Torvalds
2015-11-16 01:36:24 +0800