Eric Lee / smarc-fsl-linux-kernel

12 Sep, 2013

40 commits

5173b414e aoe: remove do-nothing NAME="%k" term from example udev rules ... Browse Code »

When the example udev rules in the documentation are used without
modification, warnings like the one shown below appear in the system logs:

/var/log/messages:Aug 22 11:09:11 kung udevd[445]: NAME="%k" \
is superfluous and breaks kernel supplied names, please remove \
it from /etc/udev/rules.d/60-aoe.rules:26

Removing the term does not cause any problems with the creation of the
special character and block device nodes.

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:28 +0800
fea1b1397 aoe: do not BUG if memory pressure prevented debugfs file creation ... Browse Code »

If the system has trouble allocating memory for the creation of the aoe
debugfs directory or of a file inside it, the debugfs member of an aoedev
can be NULL.

Do not treat a NULL debugfs pointer as a BUG on aoedev shutdown, avoiding
the user impact of an unecessary panic.

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:28 +0800
e0ec36059 aoe: suppress compiler warnings ... Browse Code »

This patch fixes following compiler warnings:

drivers/block/aoe/aoecmd.c: In function `aoecmd_ata_rw':
drivers/block/aoe/aoecmd.c:383:17: warning: variable `t' set but not used [-Wunused-but-set-variable]
struct aoetgt *t;
^
drivers/block/aoe/aoecmd.c: In function `resend':
drivers/block/aoe/aoecmd.c:488:21: warning: variable `ah' set but not used [-Wunused-but-set-variable]
struct aoe_atahdr *ah;
^

Signed-off-by: Andy Shevchenko
Cc: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2013-09-12 06:59:27 +0800
a88c1f0ca aoe: remove custom implementation of kbasename() ... Browse Code »

In the kernel we have a nice helper that may be used here. This patch
substitutes the custom implementation by the native function call.

Signed-off-by: Andy Shevchenko
Cc: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2013-09-12 06:59:26 +0800
896dcd9a6 aoe: update internal version number to 85 ... Browse Code »

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:26 +0800
ec345120c aoe: update copyright date ... Browse Code »

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:25 +0800
2256c1c51 aoe: fill in per-AoE-target information for debugfs file ... Browse Code »

This information is presented in a compact format that has evolved for
easy routine scanning by expert humans, mostly developers and support
technicians helping to troubleshoot or test AoE-based systems.

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:25 +0800
1cf94797c aoe: provide file operations for debugfs files ... Browse Code »

The place holder in the file contents is filled out in the following
patch.

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:24 +0800
e8866cf2b aoe: add AoE-target files to debugfs ... Browse Code »

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:23 +0800
190519cd3 aoe: create and destroy debugfs directory for aoe ... Browse Code »

This series adds the debugging information that the coraid.com-distributed
aoe driver exports via sysfs, but instead of sysfs, it uses debugfs.

With these patches applied, even without AoE targets on the network, KEDR
reports new possible memory leaks, but these are from callers outside the
aoe driver that have used aoe_devnode to get the name of the character
devices through the aoe_class->devnode callback, and I believe they're
responsible for freeing that memory.

This patch:

Create and destroy the debugfs directory.

Signed-off-by: Ed Cashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ed Cashin
2013-09-12 06:59:22 +0800
0bd42136f mm/zswap: use postorder iteration when destroying rbtree ... Browse Code »

Signed-off-by: Cody P Schafer
Reviewed-by: Seth Jennings
Cc: David Woodhouse
Cc: Rik van Riel
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-09-12 06:59:21 +0800
7c993e11a rbtree: allow tests to run as builtin ... Browse Code »

No reason require rbtree test code to be a module, allow it to be builtin
(streamlines my development process)

Signed-off-by: Cody P Schafer
Reviewed-by: Seth Jennings
Cc: David Woodhouse
Cc: Rik van Riel
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-09-12 06:59:20 +0800
a791a62fd rbtree_test: add test for postorder iteration ... Browse Code »

Just check that we examine all nodes in the tree for the postorder
iteration.

Signed-off-by: Cody P Schafer
Reviewed-by: Seth Jennings
Cc: David Woodhouse
Cc: Rik van Riel
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-09-12 06:59:20 +0800
2b5290892 rbtree: add rbtree_postorder_for_each_entry_safe() helper ... Browse Code »

Because deletion (of the entire tree) is a relatively common use of the
rbtree_postorder iteration, and because doing it safely means fiddling
with temporary storage, provide a helper to simplify postorder rbtree
iteration.

Signed-off-by: Cody P Schafer
Reviewed-by: Seth Jennings
Cc: David Woodhouse
Cc: Rik van Riel
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-09-12 06:59:20 +0800
9dee5c515 rbtree: add postorder iteration functions ... Browse Code »

Postorder iteration yields all of a node's children prior to yielding the
node itself, and this particular implementation also avoids examining the
leaf links in a node after that node has been yielded.

In what I expect will be its most common usage, postorder iteration allows
the deletion of every node in an rbtree without modifying the rbtree nodes
(no _requirement_ that they be nulled) while avoiding referencing child
nodes after they have been "deleted" (most commonly, freed).

I have only updated zswap to use this functionality at this point, but
numerous bits of code (most notably in the filesystem drivers) use a hand
rolled postorder iteration that NULLs child links as it traverses the
tree. Each of those instances could be replaced with this common
implementation.

1 & 2 add rbtree postorder iteration functions.
3 adds testing of the iteration to the rbtree runtime tests
4 allows building the rbtree runtime tests as builtins
5 updates zswap.

This patch:

Add postorder iteration functions for rbtree. These are useful for safely
freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer
Reviewed-by: Seth Jennings
Cc: David Woodhouse
Cc: Rik van Riel
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-09-12 06:59:19 +0800
b4bc4a18a block/partitions/efi.c: consistently use pr_foo() ... Browse Code »

Cc: Davidlohr Bueso
Cc: Karel Zak
Cc: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-12 06:59:19 +0800
70f637e90 partitions/efi: some style cleanups ... Browse Code »

Trivial coding style cleanups - still plenty left.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:19 +0800
08009b30a partitions/efi: delete annoying emacs style comments ... Browse Code »

I love emacs, but these settings for coding style are annoying when trying
to open the efi.h file. More important, we already have checkpatch for
that.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:18 +0800
aa054bc93 partitions/efi: compare first and last usable LBAs ... Browse Code »

When verifying GPT header integrity, make sure that first usable LBA is
smaller than last usable LBA.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:18 +0800
27a7c6421 partitions/efi: account for pmbr size in lba ... Browse Code »

The partition that has the 0xEE (GPT protective), must have the size in
lba field set to the lesser of the size of the disk minus one or
0xFFFFFFFF for larger disks.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:17 +0800
b05ebbbbe partitions/efi: detect hybrid MBRs ... Browse Code »

One of the biggest problems with GPT is compatibility with older, non-GPT
systems. The problem is addressed by creating hybrid mbrs, an extension,
or variant, of the traditional protective mbr. This contains, apart from
the 0xEE partition, up three additional primary partitions that point to
the same space marked by up to three GPT partitions. The result is that
legacy OSs can see the three required MBR partitions and at the same time
ignore the GPT-aware partitions that protect the GPT structures.

While hybrid MBRs are hacks, workarounds and simply not part of the GPT
standard, they do exist and we have no way around them. For instance, by
default, OSX creates a hybrid scheme when using multi-OS booting.

In order for Linux to properly discover protective MBRs, it must be made
aware of devices that have hybrid MBRs. No functionality is changed by
this patch, just a debug message informing the user of the MBR scheme that
is being used.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:16 +0800
3e69ac344 partitions/efi: do not require gpt partition to begin at sector 1 ... Browse Code »

When detecting a valid protective MBR, the Linux kernel isn't picky about
the partition (1-4) the 0xEE is at, but, unlike other operating systems,
it does require it to begin at the second sector (sector 1). This check,
apart from it not being enforced by UEFI, and causing Linux to potentially
fail to detect any *valid* partitions on the disk, can present problems
when dealing with hybrid MBRs[1].

For compatibility reasons, if the first partition is hybridized, the 0xEE
partition must be small enough to ensure that it only protects the GPT
data structures - as opposed to the the whole disk in a protective MBR.
This problem is very well described by Rod Smith[1]: where MBR-only
partitioning programs (such as older versions of fdisk) can see some of
the disk space as unallocated, thus loosing the purpose of the 0xEE
partition's protection of GPT data structures.

By dropping this check, this patch enables Linux to be more flexible when
probing for GPT disklabels.

[1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions

Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:16 +0800
33afd7a7d partitions/efi: check pmbr record's starting lba ... Browse Code »

Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that
has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the
LBA of the GPT Partition Header.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:15 +0800
c2ebdc243 partitions/efi: use lba-aware partition records ... Browse Code »

The kernel's GPT implementation currently uses the generic 'struct
partition' type for dealing with legacy MBR partition records. While this
is is useful for disklabels that we designed for CHS addressing, such as
msdos, it doesn't adapt well to newer standards that use LBA instead, such
as GUID partition tables. Furthermore, these generic partition structures
do not have all the required fields to properly follow the UEFI specs.

While a CHS address can be translated to LBA, it's much simpler and
cleaner to just replace the partition type. This patch adds a new
'gpt_record' type that is fully compliant with EFI and will allow, in the
next patches, to add more checks to properly verify a protective MBR,
which is paramount to probing a device that makes use of GPT.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Karel Zak
Acked-by: Matt Fleming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-09-12 06:59:15 +0800
6f79d3322 s390/vmcore: use vmcore for zfcpdump ... Browse Code »

Modify the s390 copy_oldmem_page() and remap_oldmem_pfn_range() function
for zfcpdump to read from the HSA memory if memory below HSA_SIZE bytes is
requested. Otherwise real memory is used.

Signed-off-by: Michael Holzheu
Cc: HATAYAMA Daisuke
Cc: Jan Willeke
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Holzheu
2013-09-12 06:59:15 +0800
11e376a3f vmcore: enable /proc/vmcore mmap for s390 ... Browse Code »

The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
now to use mmap also on s390.

So enable mmap for s390 again.

Signed-off-by: Michael Holzheu
Cc: HATAYAMA Daisuke
Cc: Jan Willeke
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Holzheu
2013-09-12 06:59:14 +0800
23df79da8 s390/vmcore: implement remap_oldmem_pfn_range for s390 ... Browse Code »

Introduce the s390 specific way to map pages from oldmem. The memory area
below OLDMEM_SIZE is mapped with offset OLDMEM_BASE. The other old memory
is mapped directly.

Signed-off-by: Jan Willeke
Signed-off-by: Michael Holzheu
Cc: HATAYAMA Daisuke
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Willeke
2013-09-12 06:59:12 +0800
9cb218131 vmcore: introduce remap_oldmem_pfn_range() ... Browse Code »

For zfcpdump we can't map the HSA storage because it is only available via
a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything
besides of the HSA memory. For the areas that are not mapped by
remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
handler mmap_vmcore_fault() is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
Return that page
* If no:
Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu
Acked-by: Vivek Goyal
Cc: HATAYAMA Daisuke
Cc: Jan Willeke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Holzheu
2013-09-12 06:59:10 +0800
97b0f6f9c s390/vmcore: use ELF header in new memory feature ... Browse Code »

Exchange the old relocate mechanism with the new arch function call
override mechanism that allows to create the ELF core header in the 2nd
kernel.

Signed-off-by: Michael Holzheu
Cc: HATAYAMA Daisuke
Cc: Jan Willeke
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Holzheu
2013-09-12 06:59:10 +0800
be8a8d069 vmcore: introduce ELF header in new memory feature ... Browse Code »

For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump). We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

The advantages of this mechanism are:

* No crashkernel memory has to be defined in the old kernel.
* Early boot problems (before kexec_load has been done) can be dumped
* Non-Linux systems can be dumped.

We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.

Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.

So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.

The following steps are done during zfcpdump execution:

1. Production system crashes
2. User boots a SCSI disk that has been prepared with the zfcpdump tool
3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4. Boot loader loads kernel into low memory area
5. Kernel boots and uses only HSA_SIZE bytes of memory
6. Kernel saves registers of non-boot CPUs
7. Kernel does memory detection for dump memory map
8. Kernel creates ELF header for /proc/vmcore
9. /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
- copy_oldmem_page() copies from HSA for memory below HSA_SIZE
- copy_oldmem_page() copies from real memory for memory above HSA_SIZE

Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick. We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data. This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.

This patch:

Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.

To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:

* elfcorehdr_alloc: Allocate ELF header
* elfcorehdr_free: Free the memory of the ELF header
* elfcorehdr_read: Read from ELF header
* elfcorehdr_read_notes: Read from ELF notes

Signed-off-by: Michael Holzheu
Acked-by: Vivek Goyal
Cc: HATAYAMA Daisuke
Cc: Jan Willeke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Holzheu
2013-09-12 06:59:10 +0800
80c74f6a4 kexec: remove unnecessary return ... Browse Code »

Code can not run here forever, so remove the unnecessary return.

Signed-off-by: Xishi Qiu
Suggested-by: Zhang Yanfei
Reviewed-by: Simon Horman
Reviewed-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-09-12 06:59:10 +0800
6b3c538f5 exec: cleanup the error handling in search_binary_handler() ... Browse Code »

The error hanling and ret-from-loop look confusing and inconsistent.

- "retval >= 0" simply returns

- "!bprm->file" returns too but with read_unlock() because
binfmt_lock was already re-acquired

- "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
relies on the same check after the main loop

Consolidate these checks into a single if/return statement.

need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
the main loop are not needed. This is only for pathological and
impossible list_empty(&formats) case.

It is not clear why do we check "bprm->mm == NULL", probably this
should be removed.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:09 +0800
4e0621a07 exec: don't retry if request_module() fails ... Browse Code »

A separate one-liner for better documentation.

It doesn't make sense to retry if request_module() fails to exec
/sbin/modprobe, add the additional "request_module() < 0" check.

However, this logic still doesn't look exactly right:

1. It would be better to check "request_module() != 0", the user
space modprobe process should report the correct exit code.
But I didn't dare to add the user-visible change.

2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
to exec a "#!path-to-unsupported-binary" script. In this case
request_module() + "retry" will be done twice: first by the
"depth == 1" code, and then again by the "depth == 0" caller
which doesn't make sense.

3. And note that in the case above bprm->buf was already changed
by load_script()->prepare_binprm(), so this looks even more
ugly.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:07 +0800
cb7b6b1cb exec: cleanup the CONFIG_MODULES logic ... Browse Code »

search_binary_handler() uses "for (try=0; try
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:05 +0800
92eaa565a exec: kill ->load_binary != NULL check in search_binary_handler() ... Browse Code »

search_binary_handler() checks ->load_binary != NULL for no reason, this
method should be always defined. Turn this check into WARN_ON() and move
it into __register_binfmt().

Also, kill the function pointer. The current code looks confusing, as if
->load_binary can go away after read_unlock(&binfmt_lock). But we rely on
module_get(fmt->module), this fmt can't be changed or unregistered,
otherwise this code is buggy anyway.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:05 +0800
52f14282b exec: move allow_write_access/fput to exec_binprm() ... Browse Code »

When search_binary_handler() succeeds it does allow_write_access() and
fput(), then it clears bprm->file to ensure the caller will not do the
same.

We can simply move this code to exec_binprm() which is called only once.
In fact we could move this to free_bprm() and remove the same code in
do_execve_common's error path.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:05 +0800
9beb266f2 exec: proc_exec_connector() should be called only once ... Browse Code »

A separate one-liner with the minor fix.

PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
least twice if search_binary_handler() is called by ->load_binary()
recursively, say, load_script().

Move it to exec_binprm(), this is "depth == 0" code too.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:05 +0800
131b2f9f1 exec: kill "int depth" in search_binary_handler() ... Browse Code »

Nobody except search_binary_handler() should touch ->recursion_depth, "int
depth" buys nothing but complicates the code, kill it.

Probably we should also kill "fn" and the !NULL check, ->load_binary
should be always defined. And it can not go away after read_unlock() or
this code is buggy anyway.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:04 +0800
5d1baf3b6 exec: introduce exec_binprm() for "depth == 0" code ... Browse Code »

task_pid_nr_ns() and trace/ptrace code in the middle of the recursive
search_binary_handler() looks confusing and imho annoying. We only need
this code if "depth == 0", lets add a simple helper which calls
search_binary_handler() and does trace_sched_process_exec() +
ptrace_event().

The patch also moves the setting of task->did_exec, we need to do this
only once.

Note: we can kill either task->did_exec or PF_FORKNOEXEC.

Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Evgeniy Polyakov
Cc: Zach Levis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:03 +0800
96d0df79f proc: make proc_fd_permission() thread-friendly ... Browse Code »

proc_fd_permission() says "process can still access /proc/self/fd after it
has executed a setuid()", but the "task_pid() = proc_pid() check only
helps if the task is group leader, /proc/self points to
/proc/.

Change this check to use task_tgid() so that the whole thread group can
access its /proc/self/fd or /proc//fd.

Notes:
- CLONE_THREAD does not require CLONE_FILES so task->files
can differ, but I don't think this can lead to any security
problem. And this matches same_thread_group() in
__ptrace_may_access().

- /proc/self should probably point to /proc/, but
it is too late to change the rules. Perhaps it makes sense
to add /proc/thread though.

Test-case:

void *tfunc(void *arg)
{
assert(opendir("/proc/self/fd"));
return NULL;
}

int main(void)
{
pthread_t t;
pthread_create(&t, NULL, tfunc, NULL);
pthread_join(t, NULL);
return 0;
}

fails if, say, this executable is not readable and suid_dumpable = 0.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:59:03 +0800