Eric Lee / smarc-fsl-linux-kernel

01 Nov, 2014

8 commits

ab01f963d Merge tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull ACPI and power management fixes from Rafael Wysocki:
"These are fixes received after my previous pull request plus one that
has been in the works for quite a while, but its previous version
caused problems to happen, so it's been deferred till now.

Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
ACPI EC regression introduced in 3.17, system suspend error code path
regression introduced in 3.15, an older bug related to recovery from
failing resume from hibernation and a cpufreq-dt driver issue related
to operation performance points.

Specifics:

- Fix a crash on r8a7791/koelsch during resume from system suspend
caused by a recent cpufreq-dt commit (Geert Uytterhoeven).

- Fix an MFD enumeration problem introduced by a recent commit adding
ACPI support to the MFD subsystem that exposed a weakness in the
ACPI core causing ACPI enumeration to be applied to all devices
associated with one ACPI companion object, although it should be
used for one of them only (Mika Westerberg).

- Fix an ACPI EC regression introduced during the 3.17 cycle causing
some Samsung laptops to misbehave as a result of a workaround
targeted at some Acer machines. That includes a revert of a commit
that went too far and a quirk for the Acer machines in question.
From Lv Zheng.

- Fix a regression in the system suspend error code path introduced
during the 3.15 cycle that causes it to fail to take errors from
asychronous execution of "late" suspend callbacks into account
(Imre Deak).

- Fix a long-standing bug in the hibernation resume error code path
that fails to roll back everything correcty on "freeze" callback
errors and leaves some devices in a "suspended" state causing more
breakage to happen subsequently (Imre Deak).

- Make the cpufreq-dt driver disable operation performance points
that are not supported by the VR connected to the CPU voltage plane
with acceptable tolerance instead of constantly failing voltage
scaling later on (Lucas Stach)"

* tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
PM / Sleep: fix recovery during resuming from hibernation
PM / Sleep: fix async suspend_late/freeze_late error handling
ACPI: Use ACPI companion to match only the first physical device
cpufreq: cpufreq-dt: disable unsupported OPPs

Linus Torvalds
2014-11-01 10:08:25 +0800
89453379a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"A bit has accumulated, but it's been a week or so since my last batch
of post-merge-window fixes, so...

1) Missing module license in netfilter reject module, from Pablo.
Lots of people ran into this.

2) Off by one in mac80211 baserate calculation, from Karl Beldan.

3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
op, which broke use of it with bonding. From Ian Morgan.

4) Checking of skb_gso_segment()'s return value was not all
encompassing, it can return an SKB pointer, a pointer error, or
NULL. Fix from Florian Westphal.

This is crummy, and longer term will be fixed to just return error
pointers or a real SKB.

6) Encapsulation offloads not being handled by
skb_gso_transport_seglen(). From Florian Westphal.

7) Fix deadlock in TIPC stack, from Ying Xue.

8) Fix performance regression from using rhashtable for netlink
sockets. The problem was the synchronize_net() invoked for every
socket destroy. From Thomas Graf.

9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
on NET. From Alexei Starovoitov.

10) In qdisc_create(), use the correct interface to allocate
->cpu_bstats, otherwise the u64_stats_sync member isn't
initialized properly. From Sabrina Dubroca.

11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.

12) nf_tables_newchain() was erroneously expecting error pointers from
netdev_alloc_pcpu_stats(). It only returna a valid pointer or
NULL. From Sabrina Dubroca.

13) Fix use-after-free in _decode_session6(), from Li RongQing.

14) When we set the TX flow hash on a socket, we mistakenly do so
before we've nailed down the final source port. Move the setting
deeper to fix this. From Sathya Perla.

15) NAPI budget accounting in amd-xgbe driver was counting descriptors
instead of full packets, fix from Thomas Lendacky.

16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
Zhang.

17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
Mehrtens.

18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
that something that ends up being vmalloc memory can't be passed
to the crypto hash routines via scatter-gather lists. From Eric
Dumazet.

19) Fix regression in promiscuous mode enabling in cdc-ether, from
Olivier Blin.

20) Bucket eviction and frag entry killing can race with eachother,
causing an unlink of the object from the wrong list. Fix from
Nikolay Aleksandrov.

21) Missing initialization of spinlock in cxgb4 driver, from Anish
Bhatt.

22) Do not cache ipv4 routing failures, otherwise if the sysctl for
forwarding is subsequently enabled this won't be seen. From
Nicolas Cavallari"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
drivers: net: cpsw: Fix broken loop condition in switch mode
net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
stmmac: pci: set default of the filter bins
net: smc91x: Fix gpios for device tree based booting
mpls: Allow mpls_gso to be built as module
mpls: Fix mpls_gso handler.
r8152: stop submitting intr for -EPROTO
netfilter: nft_reject_bridge: restrict reject to prerouting and input
netfilter: nft_reject_bridge: don't use IP stack to reject traffic
netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
drivers/net: macvtap and tun depend on INET
drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
drivers/net: Disable UFO through virtio
net: skb_fclone_busy() needs to detect orphaned skb
gre: Use inner mac length when computing tunnel length
mlx4: Avoid leaking steering rules on flow creation error flow
net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
...

Linus Torvalds
2014-11-01 06:04:58 +0800
f5fa36302 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Various scheduler fixes all over the place: three SCHED_DL fixes,
three sched/numa fixes, two generic race fixes and a comment fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dl: Fix preemption checks
sched: Update comments for CLONE_NEWNS
sched: stop the unbound recursion in preempt_schedule_context()
sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
sched/fair: Care divide error in update_task_scan_period()
sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
sched: Fix race between task_group and sched_task_group

Linus Torvalds
2014-11-01 05:05:35 +0800
5656b408f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, plus on the kernel side:

- a revert for a newly introduced PMU driver which isn't complete yet
and where we ran out of time with fixes (to be tried again in
v3.19) - this makes up for a large chunk of the diffstat.

- compilation warning fixes

- a printk message fix

- event_idx usage fixes/cleanups"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Trivial typo fix for --demangle
perf tools: Fix report -F dso_from for data without branch info
perf tools: Fix report -F dso_to for data without branch info
perf tools: Fix report -F symbol_from for data without branch info
perf tools: Fix report -F symbol_to for data without branch info
perf tools: Fix report -F mispredict for data without branch info
perf tools: Fix report -F in_tx for data without branch info
perf tools: Fix report -F abort for data without branch info
perf tools: Make CPUINFO_PROC an array to support different kernel versions
perf callchain: Use global caching provided by libunwind
perf/x86/intel: Revert incomplete and undocumented Broadwell client support
perf/x86: Fix compile warnings for intel_uncore
perf: Fix typos in sample code in the perf_event.h header
perf: Fix and clean up initialization of pmu::event_idx
perf: Fix bogus kernel printk
perf diff: Add missing hists__init() call at tool start

Linus Torvalds
2014-11-01 05:01:47 +0800
c958f9200 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull futex fixes from Ingo Molnar:
"This contains two futex fixes: one fixes a race condition, the other
clarifies shared/private futex comments"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Fix a race condition between REQUEUE_PI and task death
futex: Mention key referencing differences between shared and private futexes

Linus Torvalds
2014-11-01 04:57:45 +0800
aea4869f6 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core fixes from Ingo Molnar:
"The tree contains two RCU fixes and a compiler quirk comment fix"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make rcu_barrier() understand about missing rcuo kthreads
compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
rcu: More on deadlock between CPU hotplug and expedited grace periods

Linus Torvalds
2014-11-01 03:43:52 +0800
0f4b06766 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fixes from Thomas Gleixner:
"As you requested in the rc2 release mail the timer department serves
you a few real bug fixes:

- Fix the probe logic of the architected arm/arm64 timer
- Plug a stack info leak in posix-timers
- Prevent a shift out of bounds issue in the clockevents core"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM/ARM64: arch-timer: fix arch_timer_probed logic
clockevents: Prevent shift out of bounds
posix-timers: Fix stack info leak in timer_create()

Linus Torvalds
2014-11-01 03:33:05 +0800
bcdfdaee5 Merge tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
"ARM has system calls outside the NR_syscalls range, and the generic
tracing system does not support that and without checks, it can cause
an oops to be reported.

Rabin Vincent added checks in the return code on syscall events to
make sure that the system call number is within the range that tracing
knows about, and if not, simply ignores the system call.

The system call tracing infrastructure needs to be rewritten to handle
these cases better, but for now, to keep from oopsing, this patch will
do"

* tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/syscalls: Ignore numbers outside NR_syscalls' range

Linus Torvalds
2014-11-01 03:28:38 +0800

31 Oct, 2014

1 commit

086ba77a6 tracing/syscalls: Ignore numbers outside NR_syscalls' range ... Browse Code »

ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.

# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...

# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184

Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.

Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in

Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent
Signed-off-by: Steven Rostedt

Rabin Vincent
2014-10-31 08:58:38 +0800

30 Oct, 2014

3 commits

21ee24bf5 Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/paulmck/linux-rcu into core/urgent

Pull two RCU fixes from Paul E. McKenney:

" - Complete the work of commit dd56af42bd82 (rcu: Eliminate deadlock
between CPU hotplug and expedited grace periods), which was
intended to allow synchronize_sched_expedited() to be safely
used when holding locks acquired by CPU-hotplug notifiers.
This commit makes the put_online_cpus() avoid the deadlock
instead of just handling the get_online_cpus().

- Complete the work of commit 35ce7f29a44a (rcu: Create rcuo
kthreads only for onlined CPUs), which was intended to allow
RCU to avoid allocating unneeded kthreads on systems where the
firmware says that there are more CPUs than are really present.
This commit makes rcu_barrier() aware of the mismatch, so that
it doesn't hang waiting for non-existent CPUs. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2014-10-30 14:37:37 +0800
0baf2a4db kernel/kmod: fix use-after-free of the sub_info structure ... Browse Code »

Found this in the message log on a s390 system:

BUG kmalloc-192 (Not tainted): Poison overwritten
Disabling lock debugging due to kernel taint
INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
__slab_alloc.isra.47.constprop.56+0x5f6/0x658
kmem_cache_alloc_trace+0x106/0x408
call_usermodehelper_setup+0x70/0x128
call_usermodehelper+0x62/0x90
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
__slab_free+0x94/0x560
kfree+0x364/0x3e0
call_usermodehelper_exec+0x110/0x1b8
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc

There is a use-after-free bug on the subprocess_info structure allocated
by the user mode helper. In case do_execve() returns with an error
____call_usermodehelper() stores the error code to sub_info->retval, but
sub_info can already have been freed.

Regarding UMH_NO_WAIT, the sub_info structure can be freed by
__call_usermodehelper() before the worker thread returns from
do_execve(), allowing memory corruption when do_execve() failed after
exec_mmap() is called.

Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
call_usermodehelper_exec() to continue which then frees sub_info.

To fix this race the code needs to make sure that the call to
call_usermodehelper_freeinfo() is always done after the last store to
sub_info->retval.

Signed-off-by: Martin Schwidefsky
Reviewed-by: Oleg Nesterov
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Schwidefsky
2014-10-30 07:33:14 +0800
f601de204 gcov: add ARM64 to GCOV_PROFILE_ALL ... Browse Code »

Following up the arm testing of gcov, turns out gcov on ARM64 works fine
as well. Only change needed is adding ARM64 to Kconfig depends.

Tested with qemu and mach-virt

Signed-off-by: Riku Voipio
Acked-by: Peter Oberparleiter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Riku Voipio
2014-10-30 07:33:14 +0800

29 Oct, 2014

2 commits

6234056e1 Merge tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/g… ... Browse Code »

…it/rostedt/linux-trace

Pull ftrace trampoline accounting fixes from Steven Rostedt:
"Adding the new code for 3.19, I discovered a couple of minor bugs with
the accounting of the ftrace_ops trampoline logic.

One was that the old hash was not updated before calling the modify
code for an ftrace_ops. The second bug was what let the first bug go
unnoticed, as the update would check the current hash for all
ftrace_ops (where it should only check the old hash for modified
ones). This let things work when only one ftrace_ops was registered
to a function, but could break if more than one was registered
depending on the order of the look ups.

The worse thing that can happen if this bug triggers is that the
ftrace self checks would find an anomaly and shut itself down"

* tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
ftrace: Set ops->old_hash on modifying what an ops hooks to

Linus Torvalds
2014-10-29 04:27:19 +0800
d7e299339 rcu: Make rcu_barrier() understand about missing rcuo kthreads ... Browse Code »

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti
Reported-by: Jay Vosburgh
Reported-by: Meelis Roos
Reported-by: Eric B Munson
Signed-off-by: Paul E. McKenney
Tested-by: Eric B Munson
Tested-by: Jay Vosburgh
Tested-by: Yanko Kaneti
Tested-by: Kevin Fenzi
Tested-by: Meelis Roos

Paul E. McKenney
2014-10-29 04:24:13 +0800

28 Oct, 2014

11 commits

c719f5609 perf: Fix and clean up initialization of pmu::event_idx ... Browse Code »

Andy reported that the current state of event_idx is rather confused.
So remove all but the x86_pmu implementation and change the default to
return 0 (the safe option).

Reported-by: Andy Lutomirski
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Benjamin Herrenschmidt
Cc: Christoph Lameter
Cc: Cody P Schafer
Cc: Cody P Schafer
Cc: Heiko Carstens
Cc: Hendrik Brueckner
Cc: Himangi Saraogi
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: sukadev@linux.vnet.ibm.com
Cc: Thomas Huth
Cc: Vince Weaver
Cc: linux390@de.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-10-28 17:51:01 +0800
f3a7e1a9c sched/dl: Fix preemption checks ... Browse Code »

1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.

2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.

So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.

(I didn't change switched_from_dl() because no check is necessary
there at all).

I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:10 +0800
009f60e27 sched: stop the unbound recursion in preempt_schedule_context() ... Browse Code »

preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.

1. Change this code to dec preempt_count() and check need_resched()
by hand.

2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
the enable/disable dance around __schedule(). But in this case
we need to move into sched/core.c.

3. Cosmetic, but x86 forgets to declare this function. This doesn't
really matter because it is only called by asm helpers, still it
make sense to add the declaration into asm/preempt.h to match
preempt_schedule().

Reported-by: Sasha Levin
Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Graf
Cc: Andrew Morton
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Steven Rostedt
Cc: Peter Anvin
Cc: Andy Lutomirski
Cc: Denys Vlasenko
Cc: Chuck Ebbert
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-10-28 17:46:05 +0800
641926589 sched/fair: Fix division by zero sysctl_numa_balancing_scan_size ... Browse Code »

File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
RIP: 0010:[] [] task_scan_min+0x21/0x50
RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
Stack:
ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
Call Trace:
[] task_scan_max+0x11/0x40
[] task_numa_fault+0x1f7/0xae0
[] ? migrate_misplaced_page+0x276/0x300
[] handle_mm_fault+0x62d/0xba0
[] __do_page_fault+0x191/0x510
[] ? native_smp_send_reschedule+0x42/0x60
[] ? check_preempt_curr+0x80/0xa0
[] ? wake_up_new_task+0x11c/0x1a0
[] ? do_fork+0x14d/0x340
[] ? get_unused_fd_flags+0x2b/0x30
[] ? __fd_install+0x1f/0x60
[] do_page_fault+0xc/0x10
[] page_fault+0x22/0x30
RIP [] task_scan_min+0x21/0x50
RSP
---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Aaron Tomlin
Cc: Andrew Morton
Cc: Dario Faggioli
Cc: David Rientjes
Cc: Jens Axboe
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:04 +0800
2847c90e1 sched/fair: Care divide error in update_task_scan_period() ... Browse Code »

While offling node by hot removing memory, the following divide error
occurs:

divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [] task_numa_fault
RSP

The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:

o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------

2. node 1 is offlined by hot removing memory.

3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.

4. The values(0) of fault_types[] pass to update_task_scan_period().

5. numa_faults_locality[1] is set to 1. So the following division is
calculated.

static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}

6. But both of private and shared are set to 0. So divide error
occurs here.

The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar

Yasuaki Ishimatsu
2014-10-28 17:46:03 +0800
1effd9f19 sched/numa: Fix unsafe get_task_struct() in task_numa_assign() ... Browse Code »

Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...

As noted by Oleg:

<
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:46:02 +0800
aee38ea95 sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer() ... Browse Code »

dl_task_timer() is racy against several paths. Daniel noticed that
the replenishment timer may experience a race condition against an
enqueue_dl_entity() called from rt_mutex_setprio(). With his own
words:

rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
throttled is 0

=> BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
enqueued on the -deadline runqueue.

As we do for the other races, we just bail out in the replenishment
timer code.

Reported-by: Daniel Wagner
Tested-by: Daniel Wagner
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: vincent@legout.info
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:46:01 +0800
64be6f1f5 sched/deadline: Don't replenish from a !SCHED_DEADLINE entity ... Browse Code »

In the deboost path, right after the dl_boosted flag has been
reset, we can currently end up replenishing using -deadline
parameters of a !SCHED_DEADLINE entity. This of course causes
a bug, as those parameters are empty.

In the case depicted above it is safe to simply bail out, as
the deboosted task is going to be back to its original scheduling
class anyway.

Reported-by: Daniel Wagner
Tested-by: Daniel Wagner
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: vincent@legout.info
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:46:00 +0800
eeb61e53e sched: Fix race between task_group and sched_task_group ... Browse Code »

The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.

But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().

To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).

Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-10-28 17:45:59 +0800
f89b7755f bpf: split eBPF out of NET ... Browse Code »

introduce two configs:
- hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
depend on
- visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

that solves several problems:
- tracing and others that wish to use eBPF don't need to depend on NET.
They can use BPF_SYSCALL to allow loading from userspace or select BPF
to use it directly from kernel in NET-less configs.
- in 3.18 programs cannot be attached to events yet, so don't force it on
- when the rest of eBPF infra is there in 3.19+, it's still useful to
switch it off to minimize kernel size

bloat-o-meter on x64 shows:
add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

tested with many different config combinations. Hopefully didn't miss anything.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-10-28 07:09:59 +0800
94fb823fc PM / Sleep: fix recovery during resuming from hibernation ... Browse Code »

If a device's dev_pm_ops::freeze callback fails during the QUIESCE
phase, we don't rollback things correctly calling the thaw and complete
callbacks. This could leave some devices in a suspended state in case of
an error during resuming from hibernation.

Signed-off-by: Imre Deak
Cc: All applicable
Signed-off-by: Rafael J. Wysocki

Imre Deak
2014-10-28 01:42:26 +0800

26 Oct, 2014

2 commits

30a6b8031 futex: Fix a race condition between REQUEUE_PI and task death ... Browse Code »

free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
free_pi_state do too. requeue_pi doesn't, which means free_pi_state
can free the pi_state out from under exit_pi_state_list. For example:

task A | task B
exit_pi_state_list |
pi_state = |
curr->pi_state_list->next |
| futex_requeue(requeue_pi=1)
| // pi_state is the same as
| // the one in task A
| free_pi_state(pi_state)
| list_del_init(&pi_state->list)
| kfree(pi_state)
list_del_init(&pi_state->list) |

Move the free_pi_state calls in requeue_pi to before it drops the hb
locks which it's already holding.

[ tglx: Removed a pointless free_pi_state() call and the hb->lock held
debugging. The latter comes via a seperate patch ]

Signed-off-by: Brian Silverman
Cc: austin.linux@gmail.com
Cc: darren@dvhart.com
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
Signed-off-by: Thomas Gleixner

Brian Silverman
2014-10-26 23:16:18 +0800
993b2ff22 futex: Mention key referencing differences between shared and private futexes ... Browse Code »

Update our documentation as of fix 76835b0ebf8 (futex: Ensure
get_futex_key_refs() always implies a barrier). Explicitly
state that we don't do key referencing for private futexes.

Signed-off-by: Davidlohr Bueso
Cc: Matteo Franchin
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Acked-by: Catalin Marinas
Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2014-10-26 23:16:18 +0800

25 Oct, 2014

4 commits

10632008b clockevents: Prevent shift out of bounds ... Browse Code »

Andrey reported that on a kernel with UBSan enabled he found:

UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34

I guess it should be 1ULL here instead of 1U:
(!ismax || evt->mult << evt->shift)))

That's indeed the correct solution because shift might be 32.

Reported-by: Andrey Ryabinin
Cc: Peter Zijlstra
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-10-25 16:43:15 +0800
6891c4509 posix-timers: Fix stack info leak in timer_create() ... Browse Code »

If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.

Initialize sigev_value with 0 to plug the stack info leak.

Found in the PaX patch, written by the PaX Team.

Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause
Cc: Oleg Nesterov
Cc: Brad Spengler
Cc: PaX Team
Cc: # v2.6.28+
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner

Mathias Krause
2014-10-25 16:43:15 +0800
4fc409048 ftrace: Fix checking of trampoline ftrace_ops in finding trampoline ... Browse Code »

When modifying code, ftrace has several checks to make sure things
are being done correctly. One of them is to make sure any code it
modifies is exactly what it expects it to be before it modifies it.
In order to do so with the new trampoline logic, it must be able
to find out what trampoline a function is hooked to in order to
see if the code that hooks to it is what's expected.

The logic to find the trampoline from a record (accounting descriptor
for a function that is hooked) needs to only look at the "old_hash"
of an ops that is being modified. The old_hash is the list of function
an ops is hooked to before its update. Since a record would only be
pointing to an ops that is being modified if it was already hooked
before.

Currently, it can pick a modified ops based on its new functions it
will be hooked to, and this picks the wrong trampoline and causes
the check to fail, disabling ftrace.

Signed-off-by: Steven Rostedt

ftrace: squash into ordering of ops for modification

Steven Rostedt (Red Hat)
2014-10-25 04:53:11 +0800
8252ecf34 ftrace: Set ops->old_hash on modifying what an ops hooks to ... Browse Code »

The code that checks for trampolines when modifying function hooks
tests against a modified ops "old_hash". But the ops old_hash pointer
is not being updated before the changes are made, making it possible
to not find the right hash to the callback and possibly causing
ftrace to break in accounting and disable itself.

Have the ops set its old_hash before the modifying takes place.

Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2014-10-25 04:33:36 +0800

24 Oct, 2014

2 commits

96ed75323 Merge branch 'freezer' ... Browse Code »

* freezer:
PM / freezer: Clean up code after recent fixes
PM: convert do_each_thread to for_each_process_thread
OOM, PM: OOM killed task shouldn't escape PM suspend
freezer: remove obsolete comments in __thaw_task()
freezer: Do not freeze tasks killed by OOM killer

Rafael J. Wysocki
2014-10-24 05:02:45 +0800
37c72cac0 Merge branch 'pm-qos' ... Browse Code »

* pm-qos:
PM / QoS: Add PM_QOS_MEMORY_BANDWIDTH class

Rafael J. Wysocki
2014-10-24 05:02:36 +0800

23 Oct, 2014

2 commits

b2c4623dc rcu: More on deadlock between CPU hotplug and expedited grace periods ... Browse Code »

Commit dd56af42bd82 (rcu: Eliminate deadlock between CPU hotplug and
expedited grace periods) was incomplete. Although it did eliminate
deadlocks involving synchronize_sched_expedited()'s acquisition of
cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
deadlock involving acquisition of this same lock via put_online_cpus().
This deadlock became apparent with testing involving hibernation.

This commit therefore changes put_online_cpus() acquisition of this lock
to be conditional, and increments a new cpu_hotplug.puts_pending field
in case of acquisition failure. Then cpu_hotplug_begin() checks for this
new field being non-zero, and applies any changes to cpu_hotplug.refcount.

Reported-by: Jiri Kosina
Signed-off-by: Paul E. McKenney
Tested-by: Jiri Kosina
Tested-by: Borislav Petkov

Paul E. McKenney
2014-10-23 22:51:17 +0800
71be2114a PM / freezer: Clean up code after recent fixes ... Browse Code »

Clean up the code in process.c after recent changes to get rid of
unnecessary labels and goto statements.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2014-10-23 04:47:32 +0800

22 Oct, 2014

5 commits

32bf08a62 bpf: fix bug in eBPF verifier ... Browse Code »

while comparing for verifier state equivalency the comparison
was missing a check for uninitialized register.
Make sure it does so and add a testcase.

Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
Cc: Hannes Frederic Sowa
Signed-off-by: Alexei Starovoitov
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-10-22 09:43:46 +0800
a28e785a9 PM: convert do_each_thread to for_each_process_thread ... Browse Code »

as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
while_each_thread()) get rid of do_each_thread { } while_each_thread()
construct and replace it by a more error prone for_each_thread.

This patch doesn't introduce any user visible change.

Suggested-by: Oleg Nesterov
Signed-off-by: Michal Hocko
Signed-off-by: Rafael J. Wysocki

Michal Hocko
2014-10-22 05:44:21 +0800
5695be142 OOM, PM: OOM killed task shouldn't escape PM suspend ... Browse Code »

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: 3.2+ # 3.2+
Signed-off-by: Michal Hocko
Signed-off-by: Rafael J. Wysocki

Michal Hocko
2014-10-22 05:44:21 +0800
c05eb32f4 freezer: remove obsolete comments in __thaw_task() ... Browse Code »

__thaw_task() no longer clears frozen flag since commit a3201227f803
(freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).

Reviewed-by: Michal Hocko
Signed-off-by: Cong Wang
Signed-off-by: Rafael J. Wysocki

Cong Wang
2014-10-22 05:44:20 +0800
51fae6da6 freezer: Do not freeze tasks killed by OOM killer ... Browse Code »

Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
before deferring) OOM killer relies on being able to thaw a frozen task
to handle OOM situation but a3201227f803 (freezer: make freezing() test
freeze conditions in effect instead of TIF_FREEZE) has reorganized the
code and stopped clearing freeze flag in __thaw_task. This means that
the target task only wakes up and goes into the fridge again because the
freezing condition hasn't changed for it. This reintroduces the bug
fixed by f660daac474c6f.

Fix the issue by checking for TIF_MEMDIE thread flag in
freezing_slow_path and exclude the task from freezing completely. If a
task was already frozen it would get woken by __thaw_task from OOM killer
and get out of freezer after rechecking freezing().

Changes since v1
- put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
as per Oleg
- return __thaw_task into oom_scan_process_thread because
oom_kill_process will not wake task in the fridge because it is
sleeping uninterruptible

[mhocko@suse.cz: rewrote the changelog]
Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
Cc: 3.3+ # 3.3+
Signed-off-by: Cong Wang
Signed-off-by: Michal Hocko
Acked-by: Oleg Nesterov
Signed-off-by: Rafael J. Wysocki

Cong Wang
2014-10-22 05:44:20 +0800