Eric Lee / smarc-fsl-linux-kernel

09 Oct, 2013

1 commit

0e7a3ed04 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Various fixlets:

On the kernel side:

- fix a race
- fix a bug in the handling of the perf ring-buffer data page

On the tooling side:

- fix the handling of certain corrupted perf.data files
- fix a bug in 'perf probe'
- fix a bug in 'perf record + perf sched'
- fix a bug in 'make install'
- fix a bug in libaudit feature-detection on certain distros"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf session: Fix infinite loop on invalid perf.data file
perf tools: Fix installation of libexec components
perf probe: Fix to find line information for probe list
perf tools: Fix libaudit test
perf stat: Set child_pid after perf_evlist__prepare_workload()
perf tools: Add default handler for mmap2 events
perf/x86: Clean up cap_user_time* setting
perf: Fix perf_pmu_migrate_context

Linus Torvalds
2013-10-09 00:23:12 +0800

05 Oct, 2013

1 commit

7dee8dff4 Merge tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull ACPI and power management fixes from Rafael Wysocki:

- The resume part of user space driven hibernation (s2disk) is now
broken after the change that moved the creation of memory bitmaps to
after the freezing of tasks, because I forgot that the resume utility
loaded the image before freezing tasks and needed the bitmaps for
that. The fix adds special handling for that case.

- One of recent commits changed the export of acpi_bus_get_device() to
EXPORT_SYMBOL_GPL(), which was technically correct but broke existing
binary modules using that function including one in particularly
widespread use. Change it back to EXPORT_SYMBOL().

- The intel_pstate driver sometimes fails to disable turbo if its
no_turbo sysfs attribute is set. Fix from Srinivas Pandruvada.

- One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0
which still (incorrectly) treats non-NULL as non-error. Fix from
Philipp Zabel.

- The SPEAr cpufreq driver uses a wrong variable type in one place
preventing it from catching errors returned by one of the functions
called by it. Fix from Sachin Kamat.

* tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: Use EXPORT_SYMBOL() for acpi_bus_get_device()
intel_pstate: fix no_turbo
cpufreq: cpufreq-cpu0: NULL is a valid regulator, part 2
cpufreq: SPEAr: Fix incorrect variable type
PM / hibernate: Fix user space driven resume regression

Linus Torvalds
2013-10-05 06:03:42 +0800

04 Oct, 2013

1 commit

9886167d2 perf: Fix perf_pmu_migrate_context ... Browse Code »

While auditing the list_entry usage due to a trinity bug I found that
perf_pmu_migrate_context violates the rules for
perf_event::event_entry.

The problem is that perf_event::event_entry is a RCU list element, and
hence we must wait for a full RCU grace period before re-using the
element after deletion.

Therefore the usage in perf_pmu_migrate_context() which re-uses the
entry immediately is broken. For now introduce another list_head into
perf_event for this specific usage.

This doesn't actually fix the trinity report because that never goes
through this code.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-10-04 15:58:53 +0800

02 Oct, 2013

1 commit

0d119fb57 Merge branch 'irq/urgent-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/fr… ... Browse Code »

…ederic/linux-dynticks into irq/urgent

Pull a hardirq-nesting fix from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2013-10-02 13:53:01 +0800

01 Oct, 2013

4 commits

ded797547 irq: Force hardirq exit's softirq processing on its own stack ... Browse Code »

The commit facd8b80c67a3cf64a467c4a2ac5fb31f2e6745b
("irq: Sanitize invoke_softirq") converted irq exit
calls of do_softirq() to __do_softirq() on all architectures,
assuming it was only used there for its irq disablement
properties.

But as a side effect, the softirqs processed in the end
of the hardirq are always called on the inline current
stack that is used by irq_exit() instead of the softirq
stack provided by the archs that override do_softirq().

The result is mostly safe if the architecture runs irq_exit()
on a separate irq stack because then softirqs are processed
on that same stack that is near empty at this stage (assuming
hardirq aren't nesting).

Otherwise irq_exit() runs in the task stack and so does the softirq
too. The interrupted call stack can be randomly deep already and
the softirq can dig through it even further. To add insult to the
injury, this softirq can be interrupted by a new hardirq, maximizing
the chances for a stack overrun as reported in powerpc for example:

do_IRQ: stack overflow: 1920
CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
Call Trace:
[c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
[c0000000050a8810] .dump_stack+0x28/0x3c
[c0000000050a8880] .do_IRQ+0x2b8/0x2c0
[c0000000050a8930] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
LR = .cp_start_xmit+0x390/0x820 [8139cp]
[c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
[c0000000050a8e00] .sch_direct_xmit+0x110/0x260
[c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
[c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
[c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
[c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
[c0000000050a9130] .dev_queue_xmit+0x428/0x630
[c0000000050a91d0] .ip_finish_output+0x2a4/0x550
[c0000000050a9290] .ip_local_out+0x50/0x70
[c0000000050a9310] .ip_queue_xmit+0x148/0x420
[c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
[c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
[c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
[c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
[c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
[c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
[c0000000050a9840] .ip_rcv_finish+0x148/0x400
[c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
[c0000000050a99d0] .netif_receive_skb+0x44/0x110
[c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
[c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
[c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
[c0000000050a9c70] .nf_iterate+0x114/0x130
[c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
[c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
[c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
[c0000000050a9fa0] .netif_receive_skb+0x44/0x110
[c0000000050aa040] .napi_gro_receive+0xe8/0x120
[c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
[c0000000050aa1d0] .net_rx_action+0x1dc/0x310
[c0000000050aa2b0] .__do_softirq+0x158/0x330
[c0000000050aa3b0] .irq_exit+0xc8/0x110
[c0000000050aa430] .do_IRQ+0xdc/0x2c0
[c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .bad_range+0x1c/0x110
LR = .get_page_from_freelist+0x908/0xbb0
[c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
[c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
[c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
[c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
[c0000000050aac60] .handle_pte_fault+0x814/0xb70
[c0000000050aad50] .__get_user_pages+0x1a4/0x640
[c0000000050aae60] .get_user_pages_fast+0xec/0x160
[c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
[c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
[c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
[c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
[c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
[c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm]
[c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
[c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
[c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
[c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
[c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
[c0000000050abd80] .SyS_ioctl+0xd4/0xf0
[c0000000050abe30] syscall_exit+0x0/0x98

Since this is a regression, this patch proposes a minimalistic
and low-risk solution by blindly forcing the hardirq exit processing of
softirqs on the softirq stack. This way we should reduce significantly
the opportunities for task stack overflow dug by softirqs.

Longer term solutions may involve extending the hardirq stack coverage to
irq_exit(), etc...

Reported-by: Benjamin Herrenschmidt
Acked-by: Linus Torvalds
Signed-off-by: Frederic Weisbecker
Cc: #3.9..
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: James Hogan
Cc: James E.J. Bottomley
Cc: Helge Deller
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: David S. Miller
Cc: Andrew Morton

Frederic Weisbecker
2013-10-01 18:39:08 +0800
314a8ad0f pidns: fix free_pid() to handle the first fork failure ... Browse Code »

"case 0" in free_pid() assumes that disable_pid_allocation() should
clear PIDNS_HASH_ADDING before the last pid goes away.

However this doesn't happen if the first fork() fails to create the
child reaper which should call disable_pid_allocation().

Signed-off-by: Oleg Nesterov
Reviewed-by: "Eric W. Biederman"
Cc: "Serge E. Hallyn"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-10-01 05:31:03 +0800
4c1c7be95 kernel/kmod.c: check for NULL in call_usermodehelper_exec() ... Browse Code »

If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
dereference happens upon core dump because argv_split("") returns
argv[0] == NULL.

This bug was once fixed by commit 264b83c07a84 ("usermodehelper: check
subprocess_info->path != NULL") but was by error reintroduced by commit
7f57cfa4e2aa ("usermodehelper: kill the sub_info->path[0] check").

This bug seems to exist since 2.6.19 (the version which core dump to
pipe was added). Depending on kernel version and config, some side
effect might happen immediately after this oops (e.g. kernel panic with
2.6.32-358.18.1.el6).

Signed-off-by: Tetsuo Handa
Acked-by: Oleg Nesterov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2013-10-01 05:31:02 +0800
aab172891 PM / hibernate: Fix user space driven resume regression ... Browse Code »

Recent commit 8fd37a4 (PM / hibernate: Create memory bitmaps after
freezing user space) broke the resume part of the user space driven
hibernation (s2disk), because I forgot that the resume utility
loaded the image into memory without freezing user space (it still
freezes tasks after loading the image). This means that during user
space driven resume we need to create the memory bitmaps at the
"device open" time rather than at the "freeze tasks" time, so make
that happen (that's a special case anyway, so it needs to be treated
in a special way).

Reported-and-tested-by: Ronald
Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2013-10-01 01:40:56 +0800

29 Sep, 2013

2 commits

669fc2f0c Merge branches 'sched-urgent-for-linus', 'timers-urgent-for-linus' and 'x86-urge… ... Browse Code »

…nt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler, timer and x86 fixes from Ingo Molnar:
- A context tracking ARM build and functional fix
- A handful of ARM clocksource/clockevent driver fixes
- An AMD microcode patch level sysfs reporting fixlet

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm: Fix build error with context tracking calls

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
clocksource: of: Respect device tree node status
clocksource: exynos_mct: Set IRQ affinity when the CPU goes online
arm: clocksource: mvebu: Use the main timer as clock source from DT

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Fix patch level reporting for family 15h

Linus Torvalds
2013-09-29 05:22:17 +0800
3a126f85e kernel/params: fix handling of signed integer types ... Browse Code »

Commit 6072ddc8520b ("kernel: replace strict_strto*() with kstrto*()")
broke the handling of signed integer types, fix it.

Signed-off-by: Jean Delvare
Reported-by: Christian Kujau
Tested-by: Christian Kujau
Cc: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean Delvare
2013-09-29 03:35:52 +0800

28 Sep, 2013

1 commit

62d08aec6 Merge branch 'context_tracking/fixes' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/frederic/linux-dynticks into sched/urgent

Pull context tracking ARM fix from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2013-09-28 14:50:09 +0800

27 Sep, 2013

1 commit

0c06a5d4b arm: Fix build error with context tracking calls ... Browse Code »

ad65782fba50 (context_tracking: Optimize main APIs off case
with static key) converted context tracking main APIs to inline
function and left ARM asm callers behind.

This can be easily fixed by making ARM calling the post static
keys context tracking function. We just need to replicate the
static key checks there. We'll remove these later when ARM will
support the context tracking static keys.

Reported-by: Guenter Roeck
Reported-by: Russell King
Signed-off-by: Frederic Weisbecker
Tested-by: Kevin Hilman
Cc: Nicolas Pitre
Cc: Anil Kumar
Cc: Tony Lindgren
Cc: Benoit Cousson
Cc: Guenter Roeck
Cc: Russell King
Cc: Kevin Hilman

Frederic Weisbecker
2013-09-27 23:59:47 +0800

26 Sep, 2013

2 commits

82dfaa58a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Three small fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/balancing: Fix cfs_rq->task_h_load calculation
sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()

Linus Torvalds
2013-09-26 04:28:45 +0800
bdc5663fa Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Assorted standalone fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add model number for Avoton Silvermont
perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()
perf: Update ABI comment
tools lib lk: Uninclude linux/magic.h in debugfs.c
perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms()
perf probe: Fix finder to find lines of given function
perf session: Check for SIGINT in more loops
perf tools: Fix compile with libelf without get_phdrnum
perf tools: Fix buildid cache handling of kallsyms with kcore
perf annotate: Fix objdump line parsing offset validation
perf tools: Fill in new definitions for madvise()/mmap() flags
perf tools: Sharpen the libaudit dependencies test

Linus Torvalds
2013-09-26 04:28:08 +0800

25 Sep, 2013

4 commits

e2f0b88e8 kernel/reboot.c: re-enable the function of variable reboot_default ... Browse Code »

Commit 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic
kernel") did some cleanup for reboot= command line, but it made the
reboot_default inoperative.

The default value of variable reboot_default should be 1, and if command
line reboot= is not set, system will use the default reboot mode.

[akpm@linux-foundation.org: fix comment layout]
Signed-off-by: Li Fei
Signed-off-by: liu chuansheng
Acked-by: Robin Holt
Cc: [3.11.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chuansheng Liu
2013-09-25 08:00:26 +0800
8ac1c8d5d audit: fix endless wait in audit_log_start() ... Browse Code »

After commit 829199197a43 ("kernel/audit.c: avoid negative sleep
durations") audit emitters will block forever if userspace daemon cannot
handle backlog.

After the timeout the waiting loop turns into busy loop and runs until
daemon dies or returns back to work. This is a minimal patch for that
bug.

Signed-off-by: Konstantin Khlebnikov
Cc: Luiz Capitulino
Cc: Richard Guy Briggs
Cc: Eric Paris
Cc: Chuck Anderson
Cc: Dan Duval
Cc: Dave Kleikamp
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2013-09-25 08:00:26 +0800
9809b18fc watchdog: update watchdog_thresh properly ... Browse Code »

watchdog_tresh controls how often nmi perf event counter checks per-cpu
hrtimer_interrupts counter and blows up if the counter hasn't changed
since the last check. The counter is updated by per-cpu
watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
period which guarantees that hrtimer is scheduled 2 times per the main
period. Both hrtimer and perf event are started together when the
watchdog is enabled.

So far so good. But...

But what happens when watchdog_thresh is updated from sysctl handler?

proc_dowatchdog will set a new sampling period and hrtimer callback
(watchdog_timer_fn) will use the new value in the next round. The
problem, however, is that nobody tells the perf event that the sampling
period has changed so it is ticking with the period configured when it
has been set up.

This might result in an ear ripping dissonance between perf and hrtimer
parts if the watchdog_thresh is increased. And even worse it might lead
to KABOOM if the watchdog is configured to panic on such a spurious
lockup.

This patch fixes the issue by updating both nmi perf even counter and
hrtimers if the threshold value has changed.

The nmi one is disabled and then reinitialized from scratch. This has
an unpleasant side effect that the allocation of the new event might
fail theoretically so the hard lockup detector would be disabled for
such cpus. On the other hand such a memory allocation failure is very
unlikely because the original event is deallocated right before.

It would be much nicer if we just changed perf event period but there
doesn't seem to be any API to do that right now. It is also unfortunate
that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
cannot use on_each_cpu() and do the same thing from the per-cpu context.
The update from the current CPU should be safe because
perf_event_disable removes the event atomically before it clears the
per-cpu watchdog_ev so it cannot change anything under running handler
feet.

The hrtimer is simply restarted (thanks to Don Zickus who has pointed
this out) if it is queued because we cannot rely it will fire&adopt to
the new sampling period before a new nmi event triggers (when the
treshold is decreased).

[akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
Signed-off-by: Michal Hocko
Acked-by: Don Zickus
Cc: Frederic Weisbecker
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Fabio Estevam
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-09-25 08:00:25 +0800
359e6fab6 watchdog: update watchdog attributes atomically ... Browse Code »

proc_dowatchdog doesn't synchronize multiple callers which might lead to
confusion when two parallel callers might confuse watchdog_enable_all_cpus
resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
watchdog_thresh was set to 0 already).

This patch adds a local mutex which synchronizes callers to the sysctl
handler.

Signed-off-by: Michal Hocko
Cc: Frederic Weisbecker
Acked-by: Don Zickus
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-09-25 08:00:25 +0800

20 Sep, 2013

4 commits

7e3115ef5 sched/balancing: Fix cfs_rq->task_h_load calculation ... Browse Code »

Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.

Signed-off-by: Vladimir Davydov
Reviewed-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2013-09-20 17:59:39 +0800
3029ede39 sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance() ... Browse Code »

In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:

if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
(scaled_busy_load_per_task * imbn)) {
env->imbalance = busiest->load_per_task;
return;
}

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix it by substituting the subtraction with an equivalent addition in
the check.

[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]

Signed-off-by: Vladimir Davydov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2013-09-20 17:59:38 +0800
b18855500 sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance() ... Browse Code »

In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows

env->imbalance = min(
max_pull * busiest->group_power,
(sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE;

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.

[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]

Signed-off-by: Vladimir Davydov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2013-09-20 17:59:36 +0800
fa7315871 perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' ... Browse Code »

Solve the problems around the broken definition of perf_event_mmap_page::
cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
fixed by:

860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

The problem with the fix (merged in v3.12-rc1 and not yet released
officially), noticed by Vince Weaver is that the new behavior is
not detectable by new user-space, and that due to the reuse of the
field names it's easy to mis-compile a binary if old headers are used
on a new kernel or new headers are used on an old kernel.

To solve all that make this change explicit, detectable and self-contained,
by iterating the ABI the following way:

- Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
confuse old user-space binaries. RDPMC will be marked as unavailable
to old binaries but that's within the ABI, this is a capability bit.

- Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
libraries can reliably detect that bit 0 is deprecated and perma-zero
without having to check the kernel version.

- Use bits 2, 3, 4 for the newly defined, correct functionality:

cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
cap_user_time : 1, /* The time_* fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */

- Rename all the bitfield names in perf_event.h to be different from the
old names, to make sure it's not possible to mis-compile it
accidentally with old assumptions.

The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.

Also adjust tools/perf/ userspace for the new definitions, noticed by
Adrian Hunter.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Also-Fixed-by: Adrian Hunter
Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-09-20 15:45:11 +0800

19 Sep, 2013

2 commits

9d2cd7048 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fix from Ingo Molnar:
"An NTP related lockup fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix HRTICK related deadlock from ntp lock changes

Linus Torvalds
2013-09-19 00:24:49 +0800
7e28b2712 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Misc fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix comment for sched_info_depart
sched/Documentation: Update sched-design-CFS.txt documentation
sched/debug: Take PID namespace into account
sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

Linus Torvalds
2013-09-19 00:23:32 +0800

16 Sep, 2013

1 commit

13b62e46d sched: Fix comment for sched_info_depart ... Browse Code »

sched_info_depart seems to be only called from
sched_info_switch(), so only on involuntary task switch.

Fix the comment to match.

Signed-off-by: Michael S. Tsirkin
Cc: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: KOSAKI Motohiro
Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com
Signed-off-by: Ingo Molnar

Michael S. Tsirkin
2013-09-16 17:18:34 +0800

14 Sep, 2013

1 commit

9bf12df31 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio changes from Ben LaHaise:
"First off, sorry for this pull request being late in the merge window.
Al had raised a couple of concerns about 2 items in the series below.
I addressed the first issue (the race introduced by Gu's use of
mm_populate()), but he has not provided any further details on how he
wants to rework the anon_inode.c changes (which were sent out months
ago but have yet to be commented on).

The bulk of the changes have been sitting in the -next tree for a few
months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
aio: rcu_read_lock protection for new rcu_dereference calls
aio: fix race in ring buffer page lookup introduced by page migration support
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: remove unnecessary debugging from aio_free_ring()
aio: table lookup: verify ctx pointer
staging/lustre: kiocb->ki_left is removed
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: convert the ioctx list to table lookup v3
aio: double aio_max_nr in calculations
aio: Kill ki_dtor
aio: Kill ki_users
aio: Kill unneeded kiocb members
aio: Kill aio_rw_vect_retry()
aio: Don't use ctx->tail unnecessarily
aio: io_cancel() no longer returns the io_event
aio: percpu ioctx refcount
aio: percpu reqs_available
aio: reqs_active -> reqs_available
aio: fix build when migration is disabled
...

Linus Torvalds
2013-09-14 01:55:58 +0800

13 Sep, 2013

12 commits

0244ad004 Remove GENERIC_HARDIRQ config option ... Browse Code »

After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2013-09-13 21:09:52 +0800
ac4de9543 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"

* emailed patches from Andrew Morton : (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...

Linus Torvalds
2013-09-13 06:44:27 +0800
6072ddc85 kernel: replace strict_strto*() with kstrto*() ... Browse Code »

The usage of strict_strto*() is not preferred, because strict_strto*() is
obsolete. Thus, kstrto*() should be used.

Signed-off-by: Jingoo Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jingoo Han
2013-09-13 06:38:03 +0800
1a36e59d4 memcg: reduce function dereference ... Browse Code »

This function dereferences res far too often, so optimize it.

Signed-off-by: Sha Zhengju
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Daisuke Nishimura
Cc: Jeff Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
3af335167 memcg: avoid overflow caused by PAGE_ALIGN ... Browse Code »

Since PAGE_ALIGN is aligning up(the next page boundary), so after
PAGE_ALIGN, the value might be overflow, such as write the MAX value to
*.limit_in_bytes.

$ cat /cgroup/memory/memory.limit_in_bytes
18446744073709551615

# echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
bash: echo: write error: Invalid argument

Some user programs might depend on such behaviours(like libcg, we read
the value in snapshot, then use the value to reset cgroup later), and
that will cause confusion. So we need to fix it.

Signed-off-by: Sha Zhengju
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Daisuke Nishimura
Cc: Jeff Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
6de5a8bfc memcg: rename RESOURCE_MAX to RES_COUNTER_MAX ... Browse Code »

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Daisuke Nishimura
Cc: Jeff Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800
26935fb06 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile 4 from Al Viro:
"list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
super: fix for destroy lrus
list_lru: dynamically adjust node arrays
shrinker: Kill old ->shrink API.
shrinker: convert remaining shrinkers to count/scan API
staging/lustre/libcfs: cleanup linux-mem.h
staging/lustre/ptlrpc: convert to new shrinker API
staging/lustre/obdclass: convert lu_object shrinker to count/scan API
staging/lustre/ldlm: convert to shrinkers to count/scan API
hugepage: convert huge zero page shrinker to new shrinker API
i915: bail out earlier when shrinker cannot acquire mutex
drivers: convert shrinkers to new count/scan API
fs: convert fs shrinkers to new scan/count API
xfs: fix dquot isolation hang
xfs-convert-dquot-cache-lru-to-list_lru-fix
xfs: convert dquot cache lru to list_lru
xfs: rework buffer dispose list tracking
xfs-convert-buftarg-lru-to-generic-code-fix
xfs: convert buftarg LRU to generic code
fs: convert inode and dentry shrinking to be node aware
vmscan: per-node deferred work
...

Linus Torvalds
2013-09-13 06:01:38 +0800
02b9735c1 Merge tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull ACPI and power management fixes from Rafael Wysocki:
"All of these commits are fixes that have emerged recently and some of
them fix bugs introduced during this merge window.

Specifics:

1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

After the recent ACPIPHP changes we've seen some interesting
breakage on a system that triggers device check notifications
during boot for non-existing devices. Although those
notifications are really spurious, we should be able to deal with
them nevertheless and that shouldn't introduce too much overhead.
Four commits to make that work properly.

2) Memory hotplug and hibernation mutual exclusion rework

This was maent to be a cleanup, but it happens to fix a classical
ABBA deadlock between system suspend/hibernation and ACPI memory
hotplug which is possible if they are started roughly at the same
time. Three commits rework memory hotplug so that it doesn't
acquire pm_mutex and make hibernation use device_hotplug_lock
which prevents it from racing with memory hotplug.

3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

The ACPI LPSS driver crashes during boot on Apple Macbook Air with
Haswell that has slightly unusual BIOS configuration in which one
of the LPSS device's _CRS method doesn't return all of the
information expected by the driver. Fix from Mika Westerberg, for
stable.

4) ACPICA fix related to Store->ArgX operation

AML interpreter fix for obscure breakage that causes AML to be
executed incorrectly on some machines (observed in practice).
From Bob Moore.

5) ACPI core fix for PCI ACPI device objects lookup

There still are cases in which there is more than one ACPI device
object matching a given PCI device and we don't choose the one
that the BIOS expects us to choose, so this makes the lookup take
more criteria into account in those cases.

6) Fix to prevent cpuidle from crashing in some rare cases

If the result of cpuidle_get_driver() is NULL, which can happen on
some systems, cpuidle_driver_ref() will crash trying to use that
pointer and the Daniel Fu's fix prevents that from happening.

7) cpufreq fixes related to CPU hotplug

Stephen Boyd reported a number of concurrency problems with
cpufreq related to CPU hotplug which are addressed by a series of
fixes from Srivatsa S Bhat and Viresh Kumar.

8) cpufreq fix for time conversion in time_in_state attribute

Time conversion carried out by cpufreq when user space attempts to
read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
won't work correcty if cputime_t doesn't map directly to jiffies.
Fix from Andreas Schwab.

9) Revert of a troublesome cpufreq commit

Commit 7c30ed5 (cpufreq: make sure frequency transitions are
serialized) was intended to address some known concurrency
problems in cpufreq related to the ordering of transitions, but
unfortunately it introduced several problems of its own, so I
decided to revert it now and address the original problems later
in a more robust way.

10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

11) cpufreq fixes related to system suspend/resume

The recent cpufreq changes that made it preserve CPU sysfs
attributes over suspend/resume cycles introduced a possible NULL
pointer dereference that caused it to crash during the second
attempt to suspend. Three commits from Srivatsa S Bhat fix that
problem and a couple of related issues.

12) cpufreq locking fix

cpufreq_policy_restore() should acquire the lock for reading, but
it acquires it for writing. Fix from Lan Tianyu"

* tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
cpufreq: Restructure if/else block to avoid unintended behavior
cpufreq: Fix crash in cpufreq-stats during suspend/resume
intel_pstate: Add Haswell CPU models
Revert "cpufreq: make sure frequency transitions are serialized"
cpufreq: Use signed type for 'ret' variable, to store negative error values
cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
cpufreq: Split __cpufreq_remove_dev() into two parts
cpufreq: Fix wrong time unit conversion
cpufreq: serialize calls to __cpufreq_governor()
cpufreq: don't allow governor limits to be changed when it is disabled
ACPI / bind: Prefer device objects with _STA to those without it
ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
...

Linus Torvalds
2013-09-13 02:22:45 +0800
75acebf24 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Various fixes.

The -g perf report lockup you reported is only partially addressed,
patches that fix the excessive runtime are still being worked on"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix uncore PCI fixed counter handling
uprobes: Fix utask->depth accounting in handle_trampoline()
perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
perf: Fix up MMAP2 buffer space reservation
perf tools: Add attr->mmap2 support
perf kvm: Fix sample_type manipulation
perf evlist: Fix id pos in perf_evlist__open()
perf trace: Handle perf.data files with no tracepoints
perf session: Separate progress bar update when processing events
perf trace: Check if MAP_32BIT is defined
perf hists: Fix formatting of long symbol names
perf evlist: Fix parsing with no sample_id_all bit set
perf tools: Add test for parsing with no sample_id_all bit
perf trace: Check control+C more often

Linus Torvalds
2013-09-13 01:44:54 +0800
b55ee2816 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fix from Ingo Molnar:
"Performance regression fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix load balancing performance regression in should_we_balance()

Linus Torvalds
2013-09-13 01:44:13 +0800
fc840914e sched/debug: Take PID namespace into account ... Browse Code »

Emmanuel reported that /proc/sched_debug didn't report the right PIDs
when using namespaces, cure this.

Reported-by: Emmanuel Deloget
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-09-13 01:14:16 +0800
6c9a27f5d sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones ... Browse Code »

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

parent doing fork() | someone moving the parent to another cgroup
-------------------------------+---------------------------------------------
copy_process()
+ dup_task_struct()
-> parent->se is copied to child->se.
se.parent,cfs_rq of them point to old ones.

cgroup_attach_task()
+ cgroup_task_migrate()
-> parent->cgroup is updated.
+ cpu_cgroup_attach()
+ sched_move_task()
+ task_move_group_fair()
+- set_task_rq()
-> se.parent,cfs_rq of parent
are updated.

+ cgroup_fork()
-> parent->cgroup is copied to child->cgroup. (*1)
+ sched_fork()
+ task_fork_fair()
-> se.parent,cfs_rq of child are accessed
while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] sched_slice+0x6e/0xa0
[...]
Call Trace:
[] place_entity+0x75/0xa0
[] task_fork_fair+0xaa/0x160
[] sched_fork+0x6b/0x140
[] copy_process+0x5b2/0x1450
[] ? wake_up_new_task+0xd9/0x130
[] do_fork+0x94/0x460
[] ? sys_wait4+0xae/0x100
[] sys_clone+0x28/0x30
[] stub_clone+0x13/0x20
[] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar

Daisuke Nishimura
2013-09-13 01:14:14 +0800

12 Sep, 2013

2 commits

878b5a6ef uprobes: Fix utask->depth accounting in handle_trampoline() ... Browse Code »

Currently utask->depth is simply the number of allocated/pending
return_instance's in uprobe_task->return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
->chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin
Reported-by: Hemant Kumar Shaw
Signed-off-by: Oleg Nesterov
Acked-by: Anton Arapov
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-09-12 14:00:55 +0800
7bd360144 timekeeping: Fix HRTICK related deadlock from ntp lock changes ... Browse Code »

Gerlando Falauto reported that when HRTICK is enabled, it is
possible to trigger system deadlocks. These were hard to
reproduce, as HRTICK has been broken in the past, but seemed
to be connected to the timekeeping_seq lock.

Since seqlock/seqcount's aren't supported w/ lockdep, I added
some extra spinlock based locking and triggered the following
lockdep output:

[ 15.849182] ntpd/4062 is trying to acquire lock:
[ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [] __queue_work+0x145/0x480
[ 15.850051]
[ 15.850051] but task is already holding lock:
[ 15.850051] (timekeeper_lock){-.-.-.}, at: [] do_adjtimex+0x7f/0x100

[ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock
[ 15.850051] Possible unsafe locking scenario:
[ 15.850051]
[ 15.850051] CPU0 CPU1
[ 15.850051] ---- ----
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&p->pi_lock);
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&(&pool->lock)->rlock);
[ 15.850051]
[ 15.850051] *** DEADLOCK ***

The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping:
Hold timekeepering locks in do_adjtimex and hardpps") in 3.10

This patch avoids this deadlock, by moving the call to
schedule_delayed_work() outside of the timekeeper lock
critical section.

Reported-by: Gerlando Falauto
Tested-by: Lin Ming
Signed-off-by: John Stultz
Cc: Mathieu Desnoyers
Cc: stable #3.11, 3.10
Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar

John Stultz
2013-09-12 13:49:51 +0800