Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

02 Feb, 2015

1 commit

00845eb96 sched: don't cause task state changes in nested sleep debugging ... Browse Code »

Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.

- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)

Reported-and-bisected-by: Bruno Prémont
Tested-by: Rafael J Wysocki
Acked-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Thomas Gleixner ,
Cc: Ilya Dryomov ,
Cc: Mike Galbraith
Cc: Ingo Molnar
Cc: Peter Hurley ,
Cc: Davidlohr Bueso ,
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-02-02 04:23:32 +0800

31 Jan, 2015

1 commit

6155bc143 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also an event groups fix, two PMU driver
fixes and a CPU model variant addition"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Tighten (and fix) the grouping condition
perf/x86/intel: Add model number for Airmont
perf/rapl: Fix crash in rapl_scale()
perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
perf probe: Fix probing kretprobes
perf symbols: Introduce 'for' method to iterate over the symbols with a given name
perf probe: Do not rely on map__load() filter to find symbols
perf symbols: Introduce method to iterate symbols ordered by name
perf symbols: Return the first entry with a given name in find_by_name method
perf annotate: Fix memory leaks in LOCK handling
perf annotate: Handle ins parsing failures
perf scripting perl: Force to use stdbool
perf evlist: Remove extraneous 'was' on error message

Linus Torvalds
2015-01-31 06:34:55 +0800

28 Jan, 2015

2 commits

c3c87e770 perf: Tighten (and fix) the grouping condition ... Browse Code »

The fix from 9fc81d87420d ("perf: Fix events installation during
moving group") was incomplete in that it failed to recognise that
creating a group with events for different CPUs is semantically
broken -- they cannot be co-scheduled.

Furthermore, it leads to real breakage where, when we create an event
for CPU Y and then migrate it to form a group on CPU X, the code gets
confused where the counter is programmed -- triggered in practice
as well by me via the perf fuzzer.

Fix this by tightening the rules for creating groups. Only allow
grouping of counters that can be co-scheduled in the same context.
This means for the same task and/or the same cpu.

Fixes: 9fc81d87420d ("perf: Fix events installation during moving group")
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-01-28 20:17:35 +0800
59343cd7c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Don't OOPS on socket AIO, from Christoph Hellwig.

2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
Grumbach.

3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.

4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
Starovoitov.

5) Lots of crash, memory leak, short TX packet et al bug fixes in
sh_eth from Ben Hutchings.

6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
Borkmann.

7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
From Eric Dumazet and Govindarajulu Varadarajan.

8) Header length calculation fix in mac80211 from Fred Chou.

9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
From Ezequiel Garcia.

10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
tcp diag used. From Herbert Xu.

11) amd-xgbe programs wrong rx flow control register, from Thomas
Lendacky.

12) Fix race leading to use after free in ping receive path, from Subash
Abhinov Kasiviswanathan.

13) Cache redirect routes otherwise we can get a heavy backlog of rcu
jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
net: don't OOPS on socket aio
stmmac: prevent probe drivers to crash kernel
bnx2x: fix napi poll return value for repoll
ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
sh_eth: Fix DMA-API usage for RX buffers
sh_eth: Check for DMA mapping errors on transmit
sh_eth: Ensure DMA engines are stopped before freeing buffers
sh_eth: Remove RX overflow log messages
ping: Fix race in free in receive path
udp_diag: Fix socket skipping within chain
can: kvaser_usb: Fix state handling upon BUS_ERROR events
can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
can: kvaser_usb: Send correct context to URB completion
can: kvaser_usb: Do not sleep in atomic context
ipv4: try to cache dst_entries which would cause a redirect
samples: bpf: relax test_maps check
bpf: rcu lock must not be held when calling copy_to_user()
net: sctp: fix slab corruption from use after free on INIT collisions
net: mv643xx_eth: Fix highmem support in non-TSO egress path
sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
...

Linus Torvalds
2015-01-28 05:55:36 +0800

27 Jan, 2015

2 commits

8ebe667c4 bpf: rcu lock must not be held when calling copy_to_user() ... Browse Code »

BUG: sleeping function called from invalid context at mm/memory.c:3732
in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
1 lock held by test_maps/671:
#0: (rcu_read_lock){......}, at: [] map_lookup_elem+0xe8/0x260
Call Trace:
([] show_trace+0x12e/0x150)
[] show_stack+0xa0/0x100
[] dump_stack+0x74/0xc8
[] ___might_sleep+0x23a/0x248
[] might_fault+0x70/0xe8
[] map_lookup_elem+0x188/0x260
[] SyS_bpf+0x20e/0x840

Fix it by allocating temporary buffer to store map element value.

Fixes: db20fd2b0108 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
Reported-by: Michael Holzheu
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-01-27 09:20:40 +0800
c976a67b0 Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fix from Tejun Heo:
"The lifetime rules of cgroup hierarchies always have been somewhat
counter-intuitive and cgroup core tried to enforce that hierarchies
w/o userland-visible usages must die in finite amount of time so that
the controllers can be reused for other hierarchies; unfortunately,
this can't be implemented reasonably for the memory controller - the
kmemcg part doesn't have any way to forcefully drain the existing
usages, leading to an interruptible hang if a following mount attempts
to use the controller in any way.

So, it seems like we're stuck with "hierarchies live on till they die
whenever that may be" at least for now. This pretty much confines
attaching controllers to hierarchies to before the hierarchies are
actively used by making dynamic configurations post active usages
unreliable. This has never been reliable and should be fine in
practice given how cgroups are used.

After the patch, hierarchies aren't killed if it isn't already
drained. A following mount attempt of the same mount options will
reuse the existing hierarchy. Mount attempts with differing options
will fail w/ -EBUSY"

* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: prevent mount hang due to memory controller lifetime

Linus Torvalds
2015-01-27 07:17:34 +0800

26 Jan, 2015

2 commits

14746306a Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Thomas Gleixner:
"Hopefully the last round of fixes for 3.19

- regression fix for the LDT changes
- regression fix for XEN interrupt handling caused by the APIC
changes
- regression fixes for the PAT changes
- last minute fixes for new the MPX support
- regression fix for 32bit UP
- fix for a long standing relocation issue on 64bit tagged for stable
- functional fix for the Hyper-V clocksource tagged for stable
- downgrade of a pr_err which tends to confuse users

Looks a bit on the large side, but almost half of it are valuable
comments"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Change Fast TSC calibration failed from error to info
x86/apic: Re-enable PCI_MSI support for non-SMP X86_32
x86, mm: Change cachemode exports to non-gpl
x86, tls: Interpret an all-zero struct user_desc as "no segment"
x86, tls, ldt: Stop checking lm in LDT_empty
x86, mpx: Strictly enforce empty prctl() args
x86, mpx: Fix potential performance issue on unmaps
x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels
x86, hyperv: Mark the Hyper-V clocksource as being continuous
x86: Don't rely on VMWare emulating PAT MSR correctly
x86, irq: Properly tag virtualization entry in /proc/interrupts
x86, boot: Skip relocs when load address unchanged
x86/xen: Override ACPI IRQ management callback __acpi_unregister_gsi
ACPI: pci: Do not clear pci_dev->irq in acpi_pci_irq_disable()
x86/xen: Treat SCI interrupt as normal GSI interrupt

Linus Torvalds
2015-01-26 10:11:17 +0800
b73f0c8f4 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fixes from Thomas Gleixner:
"A set of small fixes:

- regression fix for exynos_mct clocksource

- trivial build fix for kona clocksource

- functional one liner fix for the sh_tmu clocksource

- two validation fixes to prevent (root only) data corruption in the
kernel via settimeofday and adjtimex. Tagged for stable"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: adjtimex: Validate the ADJ_FREQUENCY values
time: settimeofday: Validate the values of tv from user
clocksource: sh_tmu: Set cpu_possible_mask to fix SMP broadcast
clocksource: kona: fix __iomem annotation
clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write

Linus Torvalds
2015-01-26 09:47:34 +0800

23 Jan, 2015

2 commits

e9d1b4f3c x86, mpx: Strictly enforce empty prctl() args ... Browse Code »

Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.

commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.

Suggested-by: Andy Lutomirski
Suggested-by: Michael Kerrisk
Signed-off-by: Dave Hansen
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2015-01-23 04:11:06 +0800
193934123 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module and param fixes from Rusty Russell:
"Surprising number of fixes this merge window :(

The first two are minor fallout from the param rework which went in
this merge window.

The next three are a series which fixes a longstanding (but never
previously reported and unlikely , so no CC stable) race between
kallsyms and freeing the init section.

Finally, a minor cleanup as our module refcount will now be -1 during
unload"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: make module_refcount() a signed integer.
module: fix race in kallsyms resolution during module load success.
module: remove mod arg from module_free, rename module_memfree().
module_arch_freeing_init(): new hook for archs before module->module_init freed.
param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
param: initialize store function to NULL if not available.

Linus Torvalds
2015-01-23 02:40:36 +0800

22 Jan, 2015

3 commits

3c606d35f cgroup: prevent mount hang due to memory controller lifetime ... Browse Code »

Since b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), re-mounting the memory controller after using it is
very likely to hang.

The cgroup core assumes that any remaining references after deleting a
cgroup are temporary in nature, and synchroneously waits for them, but
the above-mentioned commit has left-over page cache pin its css until
it is reclaimed naturally. That being said, swap entries and charged
kernel memory have been doing the same indefinite pinning forever, the
bug is just more likely to trigger with left-over page cache.

Reparenting kernel memory is highly impractical, which leaves changing
the cgroup assumptions to reflect this: once a controller has been
mounted and used, it has internal state that is independent from mount
and cgroup lifetime. It can be unmounted and remounted, but it can't
be reconfigured during subsequent mounts.

Don't offline the controller root as long as there are any children,
dead or alive. A remount will no longer wait for these old references
to drain, it will simply mount the persistent controller state again.

Reported-by: "Suzuki K. Poulose"
Reported-by: Will Deacon
Signed-off-by: Johannes Weiner
Signed-off-by: Tejun Heo

Johannes Weiner
2015-01-22 23:26:43 +0800
5fbaba860 Merge branch 'fortglx/3.19-stable/time' of https://git.linaro.org/people/john.st… ... Browse Code »

…ultz/linux into timers/urgent

Pull urgent fixes from John Stultz:

Two urgent fixes for user triggerable time related overflow issues

Thomas Gleixner
2015-01-22 19:28:02 +0800
d5db139ab module: make module_refcount() a signed integer. ... Browse Code »

James Bottomley points out that it will be -1 during unload. It's
only used for diagnostics, so let's not hide that as it could be a
clue as to what's gone wrong.

Cc: Jason Wessel
Acked-and-documention-added-by: James Bottomley
Reviewed-by: Masami Hiramatsu
Signed-off-by: Rusty Russell

Rusty Russell
2015-01-22 08:45:54 +0800

21 Jan, 2015

1 commit

d4b2d0061 Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue fix from Tejun Heo:
"The xfs folks have been running into weird and very rare lockups for
some time now. I didn't think this could have been from workqueue
side because no one else was reporting it. This time, Eric had a
kdump which we looked into and it turned out this actually was a
workqueue bug and the bug has been there since the beginning of
concurrency managed workqueue.

A worker pool ensures forward progress of the workqueues associated
with it by always having at least one worker reserved from executing
work items. When the pool is under contention, the idle one tries to
create more workers for the pool and if that doesn't succeed quickly
enough, it calls the rescuers to the pool.

This logic had a subtle race condition in an early exit path. When a
worker invokes this manager function, the function may return %false
indicating that the caller may proceed to executing work items either
because another worker is already performing the role or conditions
have changed and the pool is no longer under contention.

The latter part depended on the assumption that whether more workers
are necessary or not remains stable while the pool is locked; however,
pool->nr_running (concurrency count) may change asynchronously and it
getting bumped from zero asynchronously could send off the last idle
worker to execute work items.

The race window is fairly narrow, and, even when it gets triggered,
the pool deadlocks iff if all work items get blocked on pending work
items of the pool, which is highly unlikely but can be triggered by
xfs.

The patch removes the race window by removing the early exit path,
which doesn't server any purpose anymore anyway"

* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix subtle pool management issue which can stall whole worker_pool

Linus Torvalds
2015-01-21 03:51:46 +0800

20 Jan, 2015

4 commits

c74963790 module: fix race in kallsyms resolution during module load success. ... Browse Code »

The kallsyms routines (module_symbol_name, lookup_module_* etc) disable
preemption to walk the modules rather than taking the module_mutex:
this is because they are used for symbol resolution during oopses.

This works because there are synchronize_sched() and synchronize_rcu()
in the unload and failure paths. However, there's one case which doesn't
have that: the normal case where module loading succeeds, and we free
the init section.

We don't want a synchronize_rcu() there, because it would slow down
module loading: this bug was introduced in 2009 to speed module
loading in the first place.

Thus, we want to do the free in an RCU callback. We do this in the
simplest possible way by allocating a new rcu_head: if we put it in
the module structure we'd have to worry about that getting freed.

Reported-by: Rui Xiang
Signed-off-by: Rusty Russell

Rusty Russell
2015-01-20 09:08:34 +0800
be1f221c0 module: remove mod arg from module_free, rename module_memfree(). ... Browse Code »

Nothing needs the module pointer any more, and the next patch will
call it from RCU, where the module itself might no longer exist.
Removing the arg is the safest approach.

This just codifies the use of the module_alloc/module_free pattern
which ftrace and bpf use.

Signed-off-by: Rusty Russell
Acked-by: Alexei Starovoitov
Cc: Mikael Starvik
Cc: Jesper Nilsson
Cc: Ralf Baechle
Cc: Ley Foon Tan
Cc: Benjamin Herrenschmidt
Cc: Chris Metcalf
Cc: Steven Rostedt
Cc: x86@kernel.org
Cc: Ananth N Mavinakayanahalli
Cc: Anil S Keshavamurthy
Cc: Masami Hiramatsu
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: nios2-dev@lists.rocketboards.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: netdev@vger.kernel.org

Rusty Russell
2015-01-20 09:08:33 +0800
d453cded0 module_arch_freeing_init(): new hook for archs before module->module_init freed. ... Browse Code »

Archs have been abusing module_free() to clean up their arch-specific
allocations. Since module_free() is also (ab)used by BPF and trace code,
let's keep it to simple allocations, and provide a hook called before
that.

This means that avr32, ia64, parisc and s390 no longer need to implement
their own module_free() at all. avr32 doesn't need module_finalize()
either.

Signed-off-by: Rusty Russell
Cc: Chris Metcalf
Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Tony Luck
Cc: Fenghua Yu
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux-kernel@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org

Rusty Russell
2015-01-20 09:08:32 +0800
c772be523 param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC ... Browse Code »

ignore_lockdep is uninitialized, and sysfs_attr_init() doesn't initialize
it, so memset to 0.

Reported-by: Huang Ying
Cc: Eric W. Biederman
Signed-off-by: Rusty Russell

Rusty Russell
2015-01-20 09:08:31 +0800

17 Jan, 2015

3 commits

fc7f0dd38 kernel: avoid overflow in cmp_range ... Browse Code »

Avoid overflow possibility.

[ The overflow is purely theoretical, since this is used for memory
ranges that aren't even close to using the full 64 bits, but this is
the right thing to do regardless. - Linus ]

Signed-off-by: Louis Langholtz
Cc: Yinghai Lu
Cc: Peter Anvin
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Louis Langholtz
2015-01-17 05:02:23 +0800
29187a9ee workqueue: fix subtle pool management issue which can stall whole worker_pool ... Browse Code »
3

A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.

pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo
Reported-by: Eric Sandeen
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner
Cc: Lai Jiangshan
Cc: stable@vger.kernel.org

Tejun Heo
2015-01-17 03:21:16 +0800
23aa4b416 Merge tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/g… ... Browse Code »

…it/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
"This holds a few fixes to the ftrace infrastructure as well as the
mixture of function graph tracing and kprobes.

When jprobes and function graph tracing is enabled at the same time it
will crash the system:

# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer

After the first fork (jprobe_example probes it), the system will
crash.

This is due to the way jprobes copies the stack frame and does not do
a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack and
replaces it with a hook function. It saves the return addresses in a
separate stack to put back the correct return address when done. But
because the jprobe functions do not do a normal return, their stack
addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.

The simple fix here was to disable function graph tracing while the
jprobe handler is being called.

While debugging this I found two minor bugs with the function graph
tracing.

The first was about the function graph tracer sharing its function
hash with the function tracer (they both get filtered by the same
input). The changing of the set_ftrace_filter would not sync the
function recording records after a change if the function tracer was
disabled but the function graph tracer was enabled. This was due to
the update only checking one of the ops instead of the shared ops to
see if they were enabled and should perform the sync. This caused the
ftrace accounting to break and a ftrace_bug() would be triggered,
disabling ftrace until a reboot.

The second was that the check to update records only checked one of
the filter hashes. It needs to test both the "filter" and "notrace"
hashes. The "filter" hash determines what functions to trace where as
the "notrace" hash determines what functions not to trace (trace all
but these). Both hashes need to be passed to the update code to find
out what change is being done during the update. This also broke the
ftrace record accounting and triggered a ftrace_bug().

This patch set also include two more fixes that were reported
separately from the kprobe issue.

One was that init_ftrace_syscalls() was called twice at boot up. This
is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the
first one a memory leak, and wastes memory.

The other fix is a regression caused by an update in the v3.19 merge
window. The moving to enable events early, moved the enabling before
PID 1 was created. The syscall events require setting the
TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread()
does not include the swapper task (PID 0), and ended up being a nop.

A suggested fix was to add the init_task() to have its flag set, but I
didn't really want to mess with PID 0 for this minor bug. Instead I
disable and re-enable events again at early_initcall() where it use to
be enabled. This also handles any other event that might have its own
reg function that could break at early boot up"

* tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix enabling of syscall events on the command line
tracing: Remove extra call to init_ftrace_syscalls()
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
ftrace: Check both notrace and filter for old hash
ftrace: Fix updating of filters for shared global_ops filters

Linus Torvalds
2015-01-17 02:55:52 +0800

15 Jan, 2015

4 commits

ce1039bd3 tracing: Fix enabling of syscall events on the command line ... Browse Code »

Commit 5f893b2639b2 "tracing: Move enabling tracepoints to just after
rcu_init()" broke the enabling of system call events from the command
line. The reason was that the enabling of command line trace events
was moved before PID 1 started, and the syscall tracepoints require
that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the
swapper task (pid 0) is not part of that. Since the swapper task is the
only task that is running at this early in boot, no task gets the
flag set, and the tracepoint never gets reached.

Instead of setting the swapper task flag (there should be no reason to
do that), re-enabled trace events again after the init thread (PID 1)
has been started. It requires disabling all command line events and
re-enabling them, as just enabling them again will not reset the logic
to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will
be fooled into thinking that it was already set, and wont try setting
it again. For this reason, we must first disable it and re-enable it.

Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org

Reported-by: Michael Ellerman
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-01-15 22:42:50 +0800
83829b74f tracing: Remove extra call to init_ftrace_syscalls() ... Browse Code »

trace_init() calls init_ftrace_syscalls() and then calls trace_event_init()
which also calls init_ftrace_syscalls(). It makes more sense to only
call it from trace_event_init().

Calling it twice wastes memory, as it allocates the syscall events twice,
and loses the first copy of it.

Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com
Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org

Reported-by: Wang Nan
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-01-15 22:41:11 +0800
7485058ee ftrace: Check both notrace and filter for old hash ... Browse Code »

Using just the filter for checking for trampolines or regs is not enough
when updating the code against the records that represent all functions.
Both the filter hash and the notrace hash need to be checked.

To trigger this bug (using trace-cmd and perf):

# perf probe -a do_fork
# trace-cmd start -B foo -e probe
# trace-cmd record -p function_graph -n do_fork sleep 1

The trace-cmd record at the end clears the filter before it disables
function_graph tracing and then that causes the accounting of the
ftrace function records to become incorrect and causes ftrace to bug.

Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org

Cc: stable@vger.kernel.org
[ still need to switch old_hash_ops to old_ops_hash ]
Reviewed-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-01-15 22:37:33 +0800
8f86f8370 ftrace: Fix updating of filters for shared global_ops filters ... Browse Code »

As the set_ftrace_filter affects both the function tracer as well as the
function graph tracer, the ops that represent each have a shared
ftrace_ops_hash structure. This allows both to be updated when the filter
files are updated.

But if function graph is enabled and the global_ops (function tracing) ops
is not, then it is possible that the filter could be changed without the
update happening for the function graph ops. This will cause the changes
to not take place and may even cause a ftrace_bug to occur as it could mess
with the trampoline accounting.

The solution is to check if the ops uses the shared global_ops filter and
if the ops itself is not enabled, to check if there's another ops that is
enabled and also shares the global_ops filter. In that case, the
modification still needs to be executed.

Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org

Cc: stable@vger.kernel.org # 3.17+
Reviewed-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2015-01-15 22:37:07 +0800

12 Jan, 2015

3 commits

5ab551d66 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Misc fixes: group scheduling corner case fix, two deadline scheduler
fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs
system fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
sched/deadline: Avoid double-accounting in case of missed deadlines
sched/deadline: Fix migration of SCHED_DEADLINE tasks
sched: Fix odd values in effective_load() calculations
sched, fanotify: Deal with nested sleeps
sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation

Linus Torvalds
2015-01-12 03:51:49 +0800
ddb321a8d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also some kernel side fixes: uncore PMU
driver fix, user regs sampling fix and an instruction decoder fix that
unbreaks PEBS precise sampling"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes
perf/x86_64: Improve user regs sampling
perf: Move task_pt_regs sampling into arch code
x86: Fix off-by-one in instruction decoder
perf hists browser: Fix segfault when showing callchain
perf callchain: Free callchains when hist entries are deleted
perf hists: Fix children sort key behavior
perf diff: Fix to sort by baseline field by default
perf list: Fix --raw-dump option
perf probe: Fix crash in dwarf_getcfi_elf
perf probe: Fix to fall back to find probe point in symbols
perf callchain: Append callchains only when requested
perf ui/tui: Print backtrace symbols when segfault occurs
perf report: Show progress bar for output resorting

Linus Torvalds
2015-01-12 03:47:45 +0800
1e6c3e8f8 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fixes from Ingo Molnar:
"A liblockdep fix and a mutex_unlock() mutex-debugging fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mutex: Always clear owner field upon mutex_unlock()
tools/liblockdep: Fix debug_check thinko in mutex destroy

Linus Torvalds
2015-01-12 03:46:31 +0800

10 Jan, 2015

1 commit

aa9291355 Merge tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb ... Browse Code »

Pull kgdb/kdb fixes from Jason Wessel:
"These have been around since 3.17 and in kgdb-next for the last 9
weeks and some will go back to -stable.

Summary of changes:

Cleanups
- kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE

Fixes
- kgdb/kdb: Allow access on a single core, if a CPU round up is
deemed impossible, which will allow inspection of the now "trashed"
kernel
- kdb: Add enable mask for the command groups
- kdb: access controls to restrict sensitive commands"

* tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kernel/debug/debug_core.c: Logging clean-up
kgdb: timeout if secondary CPUs ignore the roundup
kdb: Allow access to sensitive commands to be restricted by default
kdb: Add enable mask for groups of commands
kdb: Categorize kdb commands (similar to SysRq categorization)
kdb: Remove KDB_REPEAT_NONE flag
kdb: Use KDB_REPEAT_* values as flags
kdb: Rename kdb_register_repeat() to kdb_register_flags()
kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
kdb: Remove currently unused kdbtab_t->cmd_flags

Linus Torvalds
2015-01-10 12:51:10 +0800

09 Jan, 2015

7 commits

a63b03e2d mutex: Always clear owner field upon mutex_unlock() ... Browse Code »
13

Currently if DEBUG_MUTEXES is enabled, the mutex->owner field is only
cleared iff debug_locks is active. This exposes a race to other users of
the field where the mutex->owner may be still set to a stale value,
potentially upsetting mutex_spin_on_owner() among others.

References: https://bugs.freedesktop.org/show_bug.cgi?id=87955
Signed-off-by: Chris Wilson
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Davidlohr Bueso
Cc: Daniel Vetter
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1420540175-30204-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar

Chris Wilson
2015-01-09 18:20:39 +0800
7f1a169b8 sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group() ... Browse Code »

When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0)
Task dump for CPU 1:
(fprintd) R running task 0 6249 1 0x00000088
...
Call Trace:
[] sched_show_task+0xa8/0x110
[] dump_cpu_task+0x3d/0x50
[] rcu_dump_cpu_stacks+0x90/0xd0
[] rcu_check_callbacks+0x491/0x700
[] update_process_times+0x4b/0x80
[] tick_sched_handle.isra.20+0x36/0x50
[] tick_sched_timer+0x42/0x70
[] __run_hrtimer+0x69/0x1a0
[] ? tick_sched_handle.isra.20+0x50/0x50
[] hrtimer_interrupt+0xef/0x230
[] local_apic_timer_interrupt+0x3b/0x70
[] smp_apic_timer_interrupt+0x45/0x60
[] apic_timer_interrupt+0x6d/0x80
[] ? lock_hrtimer_base.isra.23+0x18/0x50
[] ? __kmalloc+0x211/0x230
[] hrtimer_try_to_cancel+0x22/0xd0
[] ? __kmalloc+0x211/0x230
[] hrtimer_cancel+0x22/0x30
[] free_fair_sched_group+0x25/0xd0
[] free_sched_group+0x16/0x40
[] sched_create_group+0x4b/0x80
[] sched_autogroup_create_attach+0x43/0x1c0
[] sys_setsid+0x7c/0x110
[] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Paul Turner
Cc: Ben Segall
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar

Tetsuo Handa
2015-01-09 18:19:00 +0800
269ad8015 sched/deadline: Avoid double-accounting in case of missed deadlines ... Browse Code »
8

The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Cc:
Cc: Dario Faggioli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar

Luca Abeni
2015-01-09 18:18:57 +0800
6a503c3be sched/deadline: Fix migration of SCHED_DEADLINE tasks ... Browse Code »
3

According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:

deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);

which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().

enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.

As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2

This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.

Signed-off-by: Luca Abeni
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Cc:
Cc: Dario Faggioli
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar

Luca Abeni
2015-01-09 18:18:56 +0800
32a8df4e0 sched: Fix odd values in effective_load() calculations ... Browse Code »
2

In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.

Reported-by: Sasha Levin
Signed-off-by: Yuyang Du
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dave Jones
Cc: Andrey Ryabinin
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar

Yuyang Du
2015-01-09 18:18:54 +0800
88a7c26af perf: Move task_pt_regs sampling into arch code ... Browse Code »

On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.

This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.

Tested-by: Jiri Olsa
Signed-off-by: Andy Lutomirski
Signed-off-by: Peter Zijlstra (Intel)
Cc: Stephane Eranian
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang
Cc: Namhyung Kim
Cc: Mike Galbraith
Cc: Arjan van de Ven
Cc: David Ahern
Cc: Arnaldo Carvalho de Melo
Cc: Catalin Marinas
Cc: Jean Pihet
Cc: Linus Torvalds
Cc: Mark Salter
Cc: Russell King
Cc: Will Deacon
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar

Andy Lutomirski
2015-01-09 18:12:28 +0800
3245d6aca exit: fix race between wait_consider_task() and wait_task_zombie() ... Browse Code »

wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.

Many thanks to Arne who carefully investigated the problem.

Note: this bug is very old but it was pure theoretical until commit
b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.

Signed-off-by: Oleg Nesterov
Reported-by: Arne Goedeke
Tested-by: Arne Goedeke
Cc: [3.15+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-01-09 07:10:51 +0800

08 Jan, 2015

2 commits

5e5aeb436 time: adjtimex: Validate the ADJ_FREQUENCY values ... Browse Code »
19

Verify that the frequency value from userspace is valid and makes sense.

Unverified values can cause overflows later on.

Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: stable
Signed-off-by: Sasha Levin
[jstultz: Fix up bug for negative values and drop redunent cap check]
Signed-off-by: John Stultz

Sasha Levin
2015-01-08 01:50:32 +0800
6ada1fc0e time: settimeofday: Validate the values of tv from user ... Browse Code »
3

An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.

Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: stable
Signed-off-by: Sasha Levin
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz

Sasha Levin
2015-01-08 01:49:14 +0800

01 Jan, 2015

1 commit

5e0f872c7 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fix from Paul Moore:
"One audit patch to resolve a panic/oops when recording filenames in
the audit log, see the mail archive link below.

The fix isn't as nice as I would like, as it involves an allocate/copy
of the filename, but it solves the problem and the overhead should
only affect users who have configured audit rules involving file
names.

We'll revisit this issue with future kernels in an attempt to make
this suck less, but in the meantime I think this fix should go into
the next release of v3.19-rcX.

[ https://marc.info/?t=141986927600001&r=1&w=2 ]"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: create private file name copies when auditing inodes

Linus Torvalds
2015-01-01 06:52:18 +0800

31 Dec, 2014

1 commit

2c90331cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.

2) Fix receive checksum handling in enic driver, from Govindarajulu
Varadarajan.

3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
Herbert Xu. Also, add code to detect drivers that have this mistake
in the future.

4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.

5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
input path,f rom Nicolas Dichtel.

6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.

7) Fix double SKB free in vxlan driver, also from Pravin.

8) When we scrub a packet, which happens when we are switching the
context of the packet (namespace, etc.), we should reset the
secmark. From Thomas Graf.

9) ->ndo_gso_check() needs to do more than return true/false, it also
has to allow the driver to clear netdev feature bits in order for
the caller to be able to proceed properly. From Jesse Gross.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
netlink/genetlink: pass network namespace to bind/unbind
ne2k-pci: Add pci_disable_device in error handling
bonding: change error message to debug message in __bond_release_one()
genetlink: pass multicast bind/unbind to families
netlink: call unbind when releasing socket
netlink: update listeners directly when removing socket
genetlink: pass only network namespace to genl_has_listeners()
netlink: rename netlink_unbind() to netlink_undo_bind()
net: Generalize ndo_gso_check to ndo_features_check
net: incorrect use of init_completion fixup
neigh: remove next ptr from struct neigh_table
net: xilinx: Remove unnecessary temac_property in the driver
net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
openvswitch: fix odd_ptr_err.cocci warnings
Bluetooth: Fix accepting connections when not using mgmt
Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
brcmfmac: Do not crash if platform data is not populated
ipw2200: select CFG80211_WEXT
...

Linus Torvalds
2014-12-31 02:45:47 +0800