Eric Lee / smarc-fsl-linux-kernel

13 Apr, 2016

1 commit

889fac6d6 Merge tag 'v4.6-rc3' into perf/core, to refresh the tree ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-04-13 14:57:03 +0800

05 Apr, 2016

2 commits

4a2d057e4 Merge branch 'PAGE_CACHE_SIZE-removal' ... Browse Code »

Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
"PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

Let's stop pretending that pages in page cache are special. They are
not.

The first patch with most changes has been done with coccinelle. The
second is manual fixups on top.

The third patch removes macros definition"

[ I was planning to apply this just before rc2, but then I spaced out,
so here it is right _after_ rc2 instead.

As Kirill suggested as a possibility, I could have decided to only
merge the first two patches, and leave the old interfaces for
compatibility, but I'd rather get it all done and any out-of-tree
modules and patches can trivially do the converstion while still also
working with older kernels, so there is little reason to try to
maintain the redundant legacy model. - Linus ]

* PAGE_CACHE_SIZE-removal:
mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

Linus Torvalds
2016-04-05 01:50:24 +0800
09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

03 Apr, 2016

2 commits

4c3b73c6a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Misc kernel side fixes:

- fix event leak
- fix AMD PMU driver bug
- fix core event handling bug
- fix build bug on certain randconfigs

Plus misc tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/ibs: Fix pmu::stop() nesting
perf/core: Don't leak event in the syscall error path
perf/core: Fix time tracking bug with multiplexing
perf jit: genelf makes assumptions about endian
perf hists: Fix determination of a callchain node's childlessness
perf tools: Add missing initialization of perf_sample.cpumode in synthesized samples
perf tools: Fix build break on powerpc
perf/x86: Move events_sysfs_show() outside CPU_SUP_INTEL
perf bench: Fix detached tarball building due to missing 'perf bench memcpy' headers
perf tests: Fix tarpkg build test error output redirection

Linus Torvalds
2016-04-03 20:22:12 +0800
7b367f5db Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core kernel fixes from Ingo Molnar:
"This contains the nohz/atomic cleanup/fix for the fetch_or() ugliness
you noted during the original nohz pull request, plus there's also
misc fixes:

- fix liblockdep build bug
- fix uapi header build bug
- print more lockdep hash collision info to help debug recent reports
of hash collisions
- update MAINTAINERS email address"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Update my email address
locking/lockdep: Print chain_key collision information
uapi/linux/stddef.h: Provide __always_inline to userspace headers
tools/lib/lockdep: Fix unsupported 'basename -s' in run_tests.sh
locking/atomic, sched: Unexport fetch_or()
timers/nohz: Convert tick dependency mask to atomic_t
locking/atomic: Introduce atomic_fetch_or()

Linus Torvalds
2016-04-03 20:06:53 +0800

02 Apr, 2016

1 commit

05cf8077e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Missing device reference in IPSEC input path results in crashes
during device unregistration. From Subash Abhinov Kasiviswanathan.

2) Per-queue ISR register writes not being done properly in macb
driver, from Cyrille Pitchen.

3) Stats accounting bugs in bcmgenet, from Patri Gynther.

4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
Quentin Armitage.

5) SXGBE driver has off-by-one in probe error paths, from Rasmus
Villemoes.

6) Fix race in save/swap/delete options in netfilter ipset, from
Vishwanath Pai.

7) Ageing time of bridge not set properly when not operating over a
switchdev device. Fix from Haishuang Yan.

8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
Duyck.

9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on the
given chipset, fix from Fabio Estevam.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
net: mvneta: fix changing MTU when using per-cpu processing
stmmac: fix MDIO settings
Revert "stmmac: Fix 'eth0: No PHY found' regression"
stmmac: fix TX normal DESC
net: mvneta: use cache_line_size() to get cacheline size
net: mvpp2: use cache_line_size() to get cacheline size
net: mvpp2: fix maybe-uninitialized warning
tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
rtnl: fix msg size calculation in if_nlmsg_size()
fec: Do not access unexisting register in Coldfire
net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
bpf: make padding in bpf_tunnel_key explicit
ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
bnxt_en: Fix ethtool -a reporting.
bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
bnxt_en: Implement proper firmware message padding.
...

Linus Torvalds
2016-04-02 09:03:33 +0800

31 Mar, 2016

11 commits

39e2e173f locking/lockdep: Print chain_key collision information ... Browse Code »

A sequence of pairs [class_idx -> corresponding chain_key iteration]
is printed for both the current held_lock chain and the cached chain.

That exposes the two different class_idx sequences that led to that
particular hash value.

This helps with debugging hash chain collision reports.

Signed-off-by: Alfredo Alvarez Fernandez
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-fsdevel@vger.kernel.org
Cc: sedat.dilek@gmail.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
Signed-off-by: Ingo Molnar

Alfredo Alvarez Fernandez
2016-03-31 21:03:58 +0800
d1b26c702 perf/ring_buffer: Prepare writing into the ring-buffer from the end ... Browse Code »

Convert perf_output_begin() to __perf_output_begin() and make the later
function able to write records from the end of the ring-buffer.

Following commits will utilize the 'backward' flag.

This is the core patch to support writing to the ring-buffer backwards,
which will be introduced by upcoming patches to support reading from
overwritable ring-buffers.

In theory, this patch should not introduce any extra performance
overhead since we use always_inline, but it does not hurt to double
check that assumption:

When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
identical to original one. See:

http://lkml.kernel.org/g/56F52E83.70409@huawei.com

When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
smaller:

$ size kernel/events/ring_buffer.o*
text data bss dec hex filename
4641 4 8 4653 122d kernel/events/ring_buffer.o.old
4545 4 8 4557 11cd kernel/events/ring_buffer.o.new

Performance testing results:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Kernel : v4.5.0

MEAN STDVAR
BASE 800214.950 2853.083
PRE 2253846.700 9997.014
POST 2257495.540 8516.293

Where 'BASE' is pure performance without capturing. 'PRE' is test
result of pure 'v4.5.0' kernel. 'POST' is test result after this
patch.

Considering the stdvar, this patch doesn't hurt performance, within
noise margin.

For testing details, see:

http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Brendan Gregg
Cc: He Kuang
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Zefan Li
Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar

Wang Nan
2016-03-31 16:30:49 +0800
1879445df perf/core: Set event's default ::overflow_handler() ... Browse Code »

Set a default event->overflow_handler in perf_event_alloc() so don't
need to check event->overflow_handler in __perf_event_overflow().
Following commits can give a different default overflow_handler.

Initial idea comes from Peter:

http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

Since the default value of event->overflow_handler is not NULL, existing
'if (!overflow_handler)' checks need to be changed.

is_default_overflow_handler() is introduced for this.

No extra performance overhead is introduced into the hot path because in the
original code we still need to read this handler from memory. A conditional
branch is avoided so actually we remove some instructions.

Signed-off-by: Wang Nan
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Brendan Gregg
Cc: He Kuang
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Zefan Li
Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar

Wang Nan
2016-03-31 16:30:47 +0800
86e7972f6 perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer ... Browse Code »

Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.

This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.

Signed-off-by: Wang Nan
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Brendan Gregg
Cc: He Kuang
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Zefan Li
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar

Wang Nan
2016-03-31 16:30:45 +0800
0a74c5b3d ftrace/perf: Check sample types only for sampling events ... Browse Code »

Currently we check sample type for ftrace:function events
even if it's not created as a sampling event. That prevents
creating ftrace_function event in counting mode.

Make sure we check sample types only for sampling events.

Before:
$ sudo perf stat -e ftrace:function ls
...

Performance counter stats for 'ls':

ftrace:function

0.001983662 seconds time elapsed

After:
$ sudo perf stat -e ftrace:function ls
...

Performance counter stats for 'ls':

44,498 ftrace:function

0.037534722 seconds time elapsed

Suggested-by: Namhyung Kim
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Steven Rostedt
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar

Jiri Olsa
2016-03-31 16:30:45 +0800
af5bb4ed1 perf/ring_buffer: Document AUX API usage ... Browse Code »

In order to ensure safe AUX buffer management, we rely on the assumption
that pmu::stop() stops its ongoing AUX transaction and not just the hw.

This patch documents this requirement for the perf_aux_output_{begin,end}()
APIs.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Poirier
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2016-03-31 16:30:43 +0800
95ff4ca26 perf/core: Free AUX pages in unmap path ... Browse Code »

Now that we can ensure that when ring buffer's AUX area is on the way
to getting unmapped new transactions won't start, we only need to stop
all events that can potentially be writing aux data to our ring buffer.

Having done that, we can safely free the AUX pages and corresponding
PMU data, as this time it is guaranteed to be the last aux reference
holder.

This partially reverts:

57ffc5ca679 ("perf: Fix AUX buffer refcounting")

... which was made to defer deallocation that was otherwise possible
from an NMI context. Now it is no longer the case; the last call to
rb_free_aux() that drops the last AUX reference has to happen in
perf_mmap_close() on that AUX area.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2016-03-31 16:30:42 +0800
dcb10a967 perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops ... Browse Code »

When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
zero, new AUX transactions into this buffer can still be started,
even though the buffer in en route to deallocation.

This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
being zero, in which case there is no point starting new transactions,
in other words, the ring buffers that pass a certain point in
perf_mmap_close will not have their events sending new data, which
clears path for freeing those buffers' pages right there and then,
provided that no active transactions are holding the AUX reference.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2016-03-31 16:30:41 +0800
266578485 perf/core: Verify we have a single perf_hw_context PMU ... Browse Code »

There should (and can) only be a single PMU for perf_hw_context
events.

This is because of how we schedule events: once a hardware event fails to
schedule (the PMU is 'full') we stop trying to add more. The trivial
'fix' would break the Round-Robin scheduling we do.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-03-31 16:30:41 +0800
201c2f85b perf/core: Don't leak event in the syscall error path ... Browse Code »

In the error path, event_file not being NULL is used to determine
whether the event itself still needs to be free'd, so fix it up to
avoid leaking.

Reported-by: Leon Yu
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: 130056275ade ("perf: Do not double free")
Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2016-03-31 15:54:07 +0800
8fdc65391 perf/core: Fix time tracking bug with multiplexing ... Browse Code »

Stephane reported that commit:

3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")

introduced a regression wrt. time tracking, as easily observed by:

> This patch introduce a bug in the time tracking of events when
> multiplexing is used.
>
> The issue is easily reproducible with the following perf run:
>
> $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
> 1.000730239 652,394 branches (66.41%)
> 1.000730239 597,809 branches (66.41%)
> 1.000730239 593,870 branches (66.63%)
> 1.000730239 651,440 branches (67.03%)
> 1.000730239 656,725 branches (66.96%)
> 1.000730239 branches
>
> One branches event is shown as not having run. Yet, with
> multiplexing, all events should run especially with a 1s (-I 1000)
> interval. The delta for time_running comes out to 0. Yet, the event
> has run because the kernel is actually multiplexing the events. The
> problem is that the time tracking is the kernel and especially in
> ctx_sched_out() is wrong now.
>
> The problem is that in case that the kernel enters ctx_sched_out() with the
> following state:
> ctx->is_active=0x7 event_type=0x1
> Call Trace:
> [] dump_stack+0x63/0x82
> [] ctx_sched_out+0x2bc/0x2d0
> [] perf_mux_hrtimer_handler+0xf6/0x2c0
> [] ? __perf_install_in_context+0x130/0x130
> [] __hrtimer_run_queues+0xf8/0x2f0
> [] hrtimer_interrupt+0xb7/0x1d0
> [] local_apic_timer_interrupt+0x38/0x60
> [] smp_apic_timer_interrupt+0x3d/0x50
> [] apic_timer_interrupt+0x8c/0xa0
>
> In that case, the test:
> if (is_active & EVENT_TIME)
>
> will be false and the time will not be updated. Time must always be updated on
> sched out.

Fix this by always updating time if EVENT_TIME was set, as opposed to
only updating time when EVENT_TIME changed.

Reported-by: Stephane Eranian
Tested-by: Stephane Eranian
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-03-31 15:54:06 +0800

29 Mar, 2016

2 commits

5529578a2 locking/atomic, sched: Unexport fetch_or() ... Browse Code »

This patch functionally reverts:

5fd7a09cfb8c ("atomic: Export fetch_or()")

During the merge Linus observed that the generic version of fetch_or()
was messy:

" This makes the ugly "fetch_or()" macro that the scheduler used
internally a new generic helper, and does a bad job at it. "

e23604edac2a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Now that we have introduced atomic_fetch_or(), fetch_or() is only used
by the scheduler in order to deal with thread_info flags which type
can vary across architectures.

Lets confine fetch_or() back to the scheduler so that we encourage
future users to use the more robust and well typed atomic_t version
instead.

While at it, fetch_or() gets robustified, pasting improvements from a
previous patch by Ingo Molnar that avoids needless expression
re-evaluations in the loop.

Reported-by: Linus Torvalds
Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-03-29 17:52:11 +0800
f009a7a76 timers/nohz: Convert tick dependency mask to atomic_t ... Browse Code »

The tick dependency mask was intially unsigned long because this is the
type on which clear_bit() operates on and fetch_or() accepts it.

But now that we have atomic_fetch_or(), we can instead use
atomic_andnot() to clear the bit. This consolidates the type of our
tick dependency mask, reduce its size on structures and benefit from
possible architecture optimizations on atomic_t operations.

Suggested-by: Linus Torvalds
Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-03-29 17:52:11 +0800

26 Mar, 2016

3 commits

be7635e72 arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections ... Browse Code »

KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to so that the
users don't need to pull in . Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.

Signed-off-by: Alexander Potapenko
Acked-by: Steven Rostedt
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800
36324a990 oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space ... Browse Code »

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko
Suggested-by: Johannes Weiner
Suggested-by: Tetsuo Handa
Cc: Andrea Argangeli
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800
69b27baf0 sched: add schedule_timeout_idle() ... Browse Code »

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-03-26 07:37:42 +0800

25 Mar, 2016

6 commits

322cea2f4 bpf: add missing map_flags to bpf_map_show_fdinfo ... Browse Code »

Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
tc can check for them when loading objects from a pinned entry, e.g.
if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf:
pre-allocate hash map elements"), so that tc can still support this
with v4.6.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-03-25 23:36:41 +0800
3d66c6ba3 Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more power management and ACPI updates from Rafael Wysocki:
"The second batch of power management and ACPI updates for v4.6.

Included are fixups on top of the previous PM/ACPI pull request and
other material that didn't make into it but still should go into 4.6.

Among other things, there's a fix for an intel_pstate driver issue
uncovered by recent cpufreq changes, a workaround for a boot hang on
Skylake-H related to the handling of deep C-states by the platform and
a PCI/ACPI fix for the handling of IO port resources on non-x86
architectures plus some new device IDs and similar.

Specifics:

- Fix for an intel_pstate driver issue related to the handling of MSR
updates uncovered by the recent cpufreq rework (Rafael Wysocki).

- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking fix
for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).

- intel_idle driver update preventing some Skylake-H systems from
hanging during initialization by disabling deep C-states mishandled
by the platform in the problematic configurations (Len Brown).

- Intel Xeon Phi Processor x200 support for intel_idle
(Dasaratharaman Chandramouli).

- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the fallback
C-state on x86 when they are set below its exit latency) and to
restore the previous behavior to fall back to C1 if the next timer
event is set far enough in the future that was changed in 4.4 which
led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).

- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).

- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).

- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

- Fix for the ACPI backend of the generic device properties API to
make it parse non-device (data node only) children of an ACPI
device correctly (Irina Tirdea).

- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).

- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).

- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven)"

* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM / AVS: rockchip-io: add io selectors and supplies for rk3399
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
PCI: ACPI: IA64: fix IO port generic range check
ACPI / util: cast data to u64 before shifting to fix sign extension
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpuidle: menu: Fall back to polling if next timer event is near
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
ACPI / property: fix data node parsing in acpi_get_next_subnode()
ACPI / APD: Add device HID for future AMD UART controller
...

Linus Torvalds
2016-03-25 13:59:58 +0800
3513ac743 Merge branches 'pm-avs', 'pm-clk', 'pm-devfreq' and 'pm-sleep' ... Browse Code »

* pm-avs:
PM / AVS: rockchip-io: add io selectors and supplies for rk3399

* pm-clk:
PM / clk: Add support for obtaining clocks from device-tree

* pm-devfreq:
PM / devfreq: Spelling s/frequnecy/frequency/

* pm-sleep:
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate

Rafael J. Wysocki
2016-03-25 07:58:18 +0800
e46b4e2b4 Merge tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing updates from Steven Rostedt:
"Nothing major this round. Mostly small clean ups and fixes.

Some visible changes:

- A new flag was added to distinguish traces done in NMI context.

- Preempt tracer now shows functions where preemption is disabled but
interrupts are still enabled.

Other notes:

- Updates were done to function tracing to allow better performance
with perf.

- Infrastructure code has been added to allow for a new histogram
feature for recording live trace event histograms that can be
configured by simple user commands. The feature itself was just
finished, but needs a round in linux-next before being pulled.

This only includes some infrastructure changes that will be needed"

* tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
tracing: Record and show NMI state
tracing: Fix trace_printk() to print when not using bprintk()
tracing: Remove redundant reset per-CPU buff in irqsoff tracer
x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
tracing: Fix crash from reading trace_pipe with sendfile
tracing: Have preempt(irqs)off trace preempt disabled functions
tracing: Fix return while holding a lock in register_tracer()
ftrace: Use kasprintf() in ftrace_profile_tracefs()
ftrace: Update dynamic ftrace calls only if necessary
ftrace: Make ftrace_hash_rec_enable return update bool
tracing: Fix typoes in code comment and printk in trace_nop.c
tracing, writeback: Replace cgroup path to cgroup ino
tracing: Use flags instead of bool in trigger structure
tracing: Add an unreg_all() callback to trigger commands
tracing: Add needs_rec flag to event triggers
tracing: Add a per-event-trigger 'paused' field
tracing: Add get_syscall_name()
tracing: Add event record param to trigger_ops.func()
tracing: Make event trigger functions available
tracing: Make ftrace_event_field checking functions available
...

Linus Torvalds
2016-03-25 01:52:25 +0800
3fa2fe2ce Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"This tree contains various perf fixes on the kernel side, plus three
hw/event-enablement late additions:

- Intel Memory Bandwidth Monitoring events and handling
- the AMD Accumulated Power Mechanism reporting facility
- more IOMMU events

... and a final round of perf tooling updates/fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
perf llvm: Use strerror_r instead of the thread unsafe strerror one
perf llvm: Use realpath to canonicalize paths
perf tools: Unexport some methods unused outside strbuf.c
perf probe: No need to use formatting strbuf method
perf help: Use asprintf instead of adhoc equivalents
perf tools: Remove unused perf_pathdup, xstrdup functions
perf tools: Do not include stringify.h from the kernel sources
tools include: Copy linux/stringify.h from the kernel
tools lib traceevent: Remove redundant CPU output
perf tools: Remove needless 'extern' from function prototypes
perf tools: Simplify die() mechanism
perf tools: Remove unused DIE_IF macro
perf script: Remove lots of unused arguments
perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
perf machine: Rename perf_event__preprocess_sample to machine__resolve
perf tools: Add cpumode to struct perf_sample
perf tests: Forward the perf_sample in the dwarf unwind test
perf tools: Remove misplaced __maybe_unused
perf list: Fix documentation of :ppp
perf bench numa: Fix assertion for nodes bitfield
...

Linus Torvalds
2016-03-25 01:02:14 +0800
be53f58fa Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
cputime fix and two cpuacct cleanups"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cpuacct: Simplify the cpuacct code
sched/cpuacct: Rename parameter in cpuusage_write() for readability
sched/fair: Add comments to explain select_idle_sibling()
sched/fair: Fix fairness issue on migration
sched/cgroup: Fix/cleanup cgroup teardown/init
sched/cputime: Fix steal time accounting vs. CPU hotplug

Linus Torvalds
2016-03-25 00:42:50 +0800

23 Mar, 2016

12 commits

276142730 PM / sleep: Clear pm_suspend_global_flags upon hibernate ... Browse Code »

When suspending to RAM, waking up and later suspending to disk,
we gratuitously runtime resume devices after the thaw phase.
This does not occur if we always suspend to RAM or always to disk.

pm_complete_with_resume_check(), which gets called from
pci_pm_complete() among others, schedules a runtime resume
if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
a suspend-to-RAM cycle. It is cleared at the beginning of
the suspend-to-RAM cycle but not afterwards and it is not
cleared during a suspend-to-disk cycle at all. Fix it.

Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
Signed-off-by: Lukas Wunner
Cc: 4.4+ # 4.4+
Signed-off-by: Rafael J. Wysocki

Lukas Wunner
2016-03-23 09:43:11 +0800
a395d6a7e kernel/...: convert pr_warning to pr_warn ... Browse Code »

Use the more common logging method with the eventual goal of removing
pr_warning altogether.

Miscellanea:

- Realign arguments
- Coalesce formats
- Add missing space between a few coalesced formats

Signed-off-by: Joe Perches
Acked-by: Rafael J. Wysocki [kernel/power/suspend.c]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-23 06:36:02 +0800
c907e0eb4 memremap: add MEMREMAP_WC flag ... Browse Code »

Add a flag to memremap() for writecombine mappings. Mappings satisfied
by this flag will not be cached, however writes may be delayed or
combined into more efficient bursts. This is most suitable for buffers
written sequentially by the CPU for use by other DMA devices.

Signed-off-by: Brian Starkey
Reviewed-by: Catalin Marinas
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brian Starkey
2016-03-23 06:36:02 +0800
cf61e2a14 memremap: don't modify flags ... Browse Code »

These patches implement a MEMREMAP_WC flag for memremap(), which can be
used to obtain writecombine mappings. This is then used for setting up
dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.

The motivation is to fix an alignment fault on arm64, and the suggestion
to implement MEMREMAP_WC for this case was made at [1]. That particular
issue is handled in patch 4, which makes sure that the appropriate
memset function is used when zeroing allocations mapped as IO memory.

This patch (of 4):

Don't modify the flags input argument to memremap(). MEMREMAP_WB is
already a special case so we can check for it directly instead of
clearing flag bits in each mapper.

Signed-off-by: Brian Starkey
Cc: Catalin Marinas
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brian Starkey
2016-03-23 06:36:02 +0800
41b271548 kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE ... Browse Code »

The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
padding) of the part of the struct siginfo that is before the union, and
it is then used to calculate the needed padding (SI_PAD_SIZE) to make
the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.

Depending on the target architecture and word width it equals to either
3 or 4 times sizeof int.

Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
parisc architecture for the 64bit kernel build. It's even more
frustrating, because it can easily be checked at compile time if the
value was defined correctly.

This patch adds such a check for the correctness of
__ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
future architectures from running into the same problem.

I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
make any difference and b) it's used in the Documentation/kmemcheck.txt
example.

I ran this patch through the 0-DAY kernel test infrastructure and only
the parisc architecture triggered as expected. That means that this
patch should be OK for all major architectures.

Signed-off-by: Helge Deller
Cc: Stephen Rothwell
Cc: Michael Ellerman
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Helge Deller
2016-03-23 06:36:02 +0800
5c9a8750a kernel: add kcov code coverage ... Browse Code »

kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov
Reviewed-by: Kees Cook
Cc: syzkaller
Cc: Vegard Nossum
Cc: Catalin Marinas
Cc: Tavis Ormandy
Cc: Will Deacon
Cc: Quentin Casasnovas
Cc: Kostya Serebryany
Cc: Eric Dumazet
Cc: Alexander Potapenko
Cc: Kees Cook
Cc: Bjorn Helgaas
Cc: Sasha Levin
Cc: David Drysdale
Cc: Ard Biesheuvel
Cc: Andrey Ryabinin
Cc: Kirill A. Shutemov
Cc: Jiri Slaby
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Vyukov
2016-03-23 06:36:02 +0800
ade356b99 profile: hide unused functions when !CONFIG_PROC_FS ... Browse Code »

A couple of functions and variables in the profile implementation are
used only on SMP systems by the procfs code, but are unused if either
procfs is disabled or in uniprocessor kernels. gcc prints a harmless
warning about the unused symbols:

kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
static void profile_flip_buffers(void)
^
kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
static void profile_discard_flip_buffers(void)
^
kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
static int profile_cpu_callback(struct notifier_block *info,
^

This adds further #ifdef to the file, to annotate exactly in which cases
they are used. I have done several thousand ARM randconfig kernels with
this patch applied and no longer get any warnings in this file.

Signed-off-by: Arnd Bergmann
Cc: Vlastimil Babka
Cc: Robin Holt
Cc: Johannes Weiner
Cc: Christoph Lameter
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-03-23 06:36:02 +0800
ebc41f20d panic: change nmi_panic from macro to function ... Browse Code »

Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic
on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save
registers even if looping in NMI context") introduced nmi_panic() which
prevents concurrent/recursive execution of panic(). It also saves
registers for the crash dump on x86.

However, there are some cases where NMI handlers still use panic().
This patch set partially replaces them with nmi_panic() in those cases.

Even this patchset is applied, some NMI or similar handlers (e.g. MCE
handler) continue to use panic(). This is because I can't test them
well and actual problems won't happen. For example, the possibility
that normal panic and panic on MCE happen simultaneously is very low.

This patch (of 3):

Convert nmi_panic() to a proper function and export it instead of
exporting internal implementation details to modules, for obvious
reasons.

Signed-off-by: Hidehiro Kawai
Acked-by: Borislav Petkov
Acked-by: Michal Nazarewicz
Cc: Michal Hocko
Cc: Rasmus Villemoes
Cc: Nicolas Iooss
Cc: Javi Merino
Cc: Gobinda Charan Maji
Cc: "Steven Rostedt (Red Hat)"
Cc: Thomas Gleixner
Cc: Vitaly Kuznetsov
Cc: HATAYAMA Daisuke
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2016-03-23 06:36:02 +0800
378c6520e fs/coredump: prevent fsuid=0 dumps into user-controlled directories ... Browse Code »

This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:

- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)

Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.

To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.

Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Andy Lutomirski
Cc: Oleg Nesterov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jann Horn
2016-03-23 06:36:02 +0800
1333ab031 ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock ... Browse Code »

This test-case (simplified version of generated by syzkaller)

#include
#include
#include

void test(void)
{
for (;;) {
if (fork()) {
wait(NULL);
continue;
}

ptrace(PTRACE_SEIZE, getppid(), 0, 0);
ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
_exit(0);
}
}

int main(void)
{
int np;

for (np = 0; np < 8; ++np)
if (!fork())
test();

while (wait(NULL) > 0)
;
return 0;
}

triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
problem is that __ptrace_unlink() clears task->jobctl under siglock but
task->ptrace is cleared without this lock held; this fools the "else"
branch which assumes that !PT_SEIZED means PT_PTRACED.

Note also that most of other PTRACE_SEIZE checks can race with detach
from the exiting tracer too. Say, the callers of ptrace_trap_notify()
assume that SEIZED can't go away after it was checked.

Signed-off-by: Oleg Nesterov
Reported-by: Dmitry Vyukov
Cc: Tejun Heo
Cc: syzkaller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-03-23 06:36:02 +0800
efbc0fbf3 auditsc: for seccomp events, log syscall compat state using in_compat_syscall ... Browse Code »

Except on SPARC, this is what the code always did. SPARC compat seccomp
was buggy, although the impact of the bug was limited because SPARC
32-bit and 64-bit syscall numbers are the same.

Signed-off-by: Andy Lutomirski
Cc: Paul Moore
Cc: Eric Paris
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-03-23 06:36:02 +0800
5c465217a ptrace: in PEEK_SIGINFO, check syscall bitness, not task bitness ... Browse Code »

Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
translation should check ptrace() ABI, not caller task ABI.

This is an ABI change on SPARC. Let's hope that no one relied on the
old buggy ABI.

Signed-off-by: Andy Lutomirski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-03-23 06:36:02 +0800