Eric Lee / smarc-fsl-linux-kernel

26 Jun, 2018

1 commit

250edf958 fs/binfmt_misc.c: do not allow offset overflow ... Browse Code »

commit 5cc41e099504b77014358b58567c5ea6293dd220 upstream.

WHen registering a new binfmt_misc handler, it is possible to overflow
the offset to get a negative value, which might crash the system, or
possibly leak kernel data.

Here is a crash log when 2500000000 was used as an offset:

BUG: unable to handle kernel paging request at ffff989cfd6edca0
IP: load_misc_binary+0x22b/0x470 [binfmt_misc]
PGD 1ef3e067 P4D 1ef3e067 PUD 0
Oops: 0000 [#1] SMP NOPTI
Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy
CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc]
Call Trace:
search_binary_handler+0x97/0x1d0
do_execveat_common.isra.34+0x667/0x810
SyS_execve+0x31/0x40
do_syscall_64+0x73/0x130
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Use kstrtoint instead of simple_strtoul. It will work as the code
already set the delimiter byte to '\0' and we only do it when the field
is not empty.

Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX. Also tested
with examples documented at Documentation/admin-guide/binfmt-misc.rst
and other registrations from packages on Ubuntu.

Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Thadeu Lima de Souza Cascardo
Reviewed-by: Andrew Morton
Cc: Alexander Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Thadeu Lima de Souza Cascardo
2018-06-26 08:06:33 +0800

14 Oct, 2017

1 commit

7e8660060 fs/binfmt_misc.c: node could be NULL when evicting inode ... Browse Code »

inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode. e.g. mount binfmt_misc filesystem then
umount it immediately:

mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
umount /proc/sys/fs/binfmt_misc

will result in

BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
...
Call Trace:
evict+0xd3/0x1a0
iput+0x17d/0x1d0
dentry_unlink_inode+0xb9/0xf0
__dentry_kill+0xc7/0x170
shrink_dentry_list+0x122/0x280
shrink_dcache_parent+0x39/0x90
do_one_tree+0x12/0x40
shrink_dcache_for_umount+0x2d/0x90
generic_shutdown_super+0x1f/0x120
kill_litter_super+0x29/0x40
deactivate_locked_super+0x43/0x70
deactivate_super+0x45/0x60
cleanup_mnt+0x3f/0x70
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xa0
exit_to_usermode_loop+0x6d/0x99
syscall_return_slowpath+0xba/0xf0
entry_SYSCALL_64_fastpath+0xa3/0xa

Fix it by making sure Node (e) is not NULL.

Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: Eryu Guan
Acked-by: Oleg Nesterov
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eryu Guan
2017-10-14 07:18:33 +0800

04 Oct, 2017

5 commits

50097f749 exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array ... Browse Code »

After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.

Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
43a4f2619 exec: binfmt_misc: fix race between load_misc_binary() and kill_node() ... Browse Code »

load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.

Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.

Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc: Travis Gummels
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
eb23aa031 exec: binfmt_misc: remove the confusing e->interp_file != NULL checks ... Browse Code »

If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.

Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
83f918274 exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode() ... Browse Code »

To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.

The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

And I can't understand why do we use filp_close(), not fput().

Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
baba1b297 exec: binfmt_misc: don't nullify Node->dentry in kill_node() ... Browse Code »

kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:

- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.

- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.

We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().

Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800

05 Sep, 2017

1 commit

bdd1d2d3d fs: fix kernel_read prototype ... Browse Code »

Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2017-09-05 07:05:15 +0800

27 Apr, 2017

1 commit

cda37124f fs: constify tree_descr arrays passed to simple_fill_super() ... Browse Code »

simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers
Signed-off-by: Al Viro

Eric Biggers
2017-04-27 11:54:06 +0800

02 Mar, 2017

1 commit

589ee6284 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h> ... Browse Code »

Update code that relied on sched.h including various MM types for them.

This will allow us to remove the include from .

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:37 +0800

28 Sep, 2016

1 commit

c2050a454 fs: Replace current_fs_time() with current_time() ... Browse Code »

current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani
Signed-off-by: Al Viro

Deepa Dinamani
2016-09-28 09:06:22 +0800

07 Aug, 2016

1 commit

e9d488c31 Merge tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc ... Browse Code »

Pull binfmt_misc update from James Bottomley:
"This update is to allow architecture emulation containers to function
such that the emulation binary can be housed outside the container
itself. The container and fs parts both have acks from relevant
experts.

To use the new feature you have to add an F option to your binfmt_misc
configuration"

From the docs:
"The usual behaviour of binfmt_misc is to spawn the binary lazily when
the misc format file is invoked. However, this doesn't work very well
in the face of mount namespaces and changeroots, so the F mode opens
the binary as soon as the emulation is installed and uses the opened
image to spawn the emulator, meaning it is always available once
installed, regardless of how the environment changes"

* tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
binfmt_misc: add F option description to documentation
binfmt_misc: add persistent opened binary handler for containers
fs: add filp_clone_open API

Linus Torvalds
2016-08-07 22:13:14 +0800

30 May, 2016

1 commit

ea7d4c046 binfmt_misc: ->s_root is not going anywhere ... Browse Code »

... no need to dget/dput it.

Signed-off-by: Al Viro

Al Viro
2016-05-30 07:14:03 +0800

31 Mar, 2016

1 commit

948b701a6 binfmt_misc: add persistent opened binary handler for containers ... Browse Code »

This patch adds a new flag 'F' to the binfmt handlers. If you pass in
'F' the binary that runs the emulation will be opened immediately and
in future, will be cloned from the open file.

The net effect is that the handler survives both changeroots and mount
namespace changes, making it easy to work with foreign architecture
containers without contaminating the container image with the
emulator.

Signed-off-by: James Bottomley
Acked-by: Serge Hallyn

James Bottomley
2016-03-31 05:12:22 +0800

23 Jan, 2016

1 commit

5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
2016-01-23 07:04:28 +0800

27 Apr, 2015

1 commit

9ec3a646f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only

Linus Torvalds
2015-04-27 08:22:07 +0800

17 Apr, 2015

1 commit

6ceafb880 fs/binfmt_misc.c: simplify entry_status() ... Browse Code »

sprintf() reliably returns the number of characters printed, so we don't
need to ask strlen() where we are. Also replace calling sprintf("%02x")
in a loop with the much simpler bin2hex().

[akpm@linux-foundation.org: it's odd to include kernel.h after everything else]
Signed-off-by: Rasmus Villemoes
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-04-17 21:03:59 +0800

16 Apr, 2015

1 commit

75c3cfa85 VFS: assorted weird filesystems: d_inode() annotations ... Browse Code »

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2015-04-16 03:06:58 +0800

17 Dec, 2014

1 commit

7d65cf10e unfuck binfmt_misc.c (broken by commit e6084d4 ) ... Browse Code »

scanarg(s, del) never returns s; the empty field results in s + 1.
Restore the correct checks, and move NUL-termination into scanarg(),
while we are at it.

Incidentally, mixing "coding style cleanups" (for small values of cleanup)
with functional changes is a Bad Idea(tm)...

Signed-off-by: Al Viro

Al Viro
2014-12-17 21:27:14 +0800

14 Dec, 2014

1 commit

51f39a1f0 syscalls: implement execveat() system call ... Browse Code »

This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.

This patch (of 4):

Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/"
(for an empty filename) or "/dev/fd//", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.

Signed-off-by: David Drysdale
Cc: Meredydd Luff
Cc: Shuah Khan
Cc: "Eric W. Biederman"
Cc: Andy Lutomirski
Cc: Alexander Viro
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Kees Cook
Cc: Arnd Bergmann
Cc: Rich Felker
Cc: Christoph Hellwig
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Drysdale
2014-12-14 04:42:51 +0800

11 Dec, 2014

4 commits

f7e1ad1a1 fs/binfmt_misc.c: use GFP_KERNEL instead of GFP_USER ... Browse Code »

GFP_USER means "honour cpuset nodes-allowed beancounting". These are
regular old kernel objects and there seems no reason to give them this
treatment.

Acked-by: Mike Frysinger
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-12-11 09:41:12 +0800
e6084d4a0 binfmt_misc: clean up code style a bit ... Browse Code »

Clean up various coding style issues that checkpatch complains about.
No functional changes here.

Signed-off-by: Mike Frysinger
Cc: Al Viro
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2014-12-11 09:41:12 +0800
6b899c4e9 binfmt_misc: add comments & debug logs ... Browse Code »

When trying to develop a custom format handler, the errors returned all
effectively get bucketed as EINVAL with no kernel messages. The other
errors (ENOMEM/EFAULT) are internal/obvious and basic. Thus any time a
bad handler is rejected, the developer has to walk the dense code and
try to guess where it went wrong. Needing to dive into kernel code is
itself a fairly high barrier for a lot of people.

To improve this situation, let's deploy extensive pr_debug markers at
logical parse points, and add comments to the dense parsing logic. It
let's you see exactly where the parsing aborts, the string the kernel
received (useful when dealing with shell code), how it translated the
buffers to binary data, and how it will apply the mask at runtime.

Some example output:
$ echo ':qemu-foo:M::\x7fELF\xAD\xAD\x01\x00:\xff\xff\xff\xff\xff\x00\xff\x00:/usr/bin/qemu-foo:POC' > register
$ dmesg
binfmt_misc: register: received 92 bytes
binfmt_misc: register: delim: 0x3a {:}
binfmt_misc: register: name: {qemu-foo}
binfmt_misc: register: type: M (magic)
binfmt_misc: register: offset: 0x0
binfmt_misc: register: magic[raw]: 5c 78 37 66 45 4c 46 5c 78 41 44 5c 78 41 44 5c \x7fELF\xAD\xAD\
binfmt_misc: register: magic[raw]: 78 30 31 5c 78 30 30 00 x01\x00.
binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 66 66 5c 78 66 66 5c 78 66 66 \xff\xff\xff\xff
binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 30 30 5c 78 66 66 5c 78 30 30 \xff\x00\xff\x00
binfmt_misc: register: mask[raw]: 00 .
binfmt_misc: register: magic/mask length: 8
binfmt_misc: register: magic[decoded]: 7f 45 4c 46 ad ad 01 00 .ELF....
binfmt_misc: register: mask[decoded]: ff ff ff ff ff 00 ff 00 ........
binfmt_misc: register: magic[masked]: 7f 45 4c 46 ad 00 01 00 .ELF....
binfmt_misc: register: interpreter: {/usr/bin/qemu-foo}
binfmt_misc: register: flag: P (preserve argv0)
binfmt_misc: register: flag: O (open binary)
binfmt_misc: register: flag: C (preserve creds)

The [raw] lines show us exactly what was received from userspace. The
lines after that show us how the kernel has decoded things.

Signed-off-by: Mike Frysinger
Cc: Al Viro
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2014-12-11 09:41:12 +0800
c6cb898b5 binfmt_misc: replace get_unused_fd() with get_unused_fd_flags(0) ... Browse Code »

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.

Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2014-12-11 09:41:10 +0800

14 Oct, 2014

2 commits

de8288b1f binfmt_misc: work around gcc-4.9 warning ... Browse Code »

gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
parse_command function:

fs/binfmt_misc.c: In function 'parse_command.part.3':
fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]

I've managed to trace this back to the ARM implementation of memset,
which is called from copy_from_user in case of a fault and which does

#define memset(p,v,n) \
({ \
void *__p = (p); size_t __n = n; \
if ((__n) != 0) { \
if (__builtin_constant_p((v)) && (v) == 0) \
__memzero((__p),(__n)); \
else \
memset((__p),(v),(__n)); \
} \
(__p); \
})

Apparently gcc gets confused by the check for "size != 0" and believes
that the size might be zero when it gets to the line that does "if
(s[count-1] == '\n')", so it would access data outside of the array.

gcc is clearly wrong here, since this condition was already checked
earlier in the function and the 'size' value can not change in the
meantime.

Fortunately, we can work around it and get rid of the warning by
rearranging the function to check for zero size after doing the
copy_from_user. It is still safe to pass a zero size into
copy_from_user, so it does not cause any side effects.

Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2014-10-14 08:18:16 +0800
bbaecc088 binfmt_misc: expand the register format limit to 1920 bytes ... Browse Code »

The current code places a 256 byte limit on the registration format.
This ends up being fairly limited when you try to do matching against a
binary format like ELF:

- the magic & mask formats cannot have any embedded NUL chars
(string_unescape_inplace halts at the first NUL)
- each escape sequence quadruples the size: \x00 is needed for NUL
- trying to match bytes at the start of the file as well as further
on leads to a lot of \x00 sequences in the mask
- magic & mask have to be the same length (when decoded)
- still need bytes for the other fields
- impossible!

Let's look at a concrete (and common) example: using QEMU to run MIPS
ELFs. The name field uses 11 bytes "qemu-mipsel". The interp uses 20
bytes "/usr/bin/qemu-mipsel". The type & flags takes up 4 bytes. We
need 7 bytes for the delimiter (usually ":"). We can skip offset. So
already we're down to 107 bytes to use with the magic/mask instead of
the real limit of 128 (BINPRM_BUF_SIZE). If people use shell code to
register (which they do the majority of the time), they're down to ~26
possible bytes since the escape sequence must be \x##.

The ELF format looks like (both 32 & 64 bit):

e_ident: 16 bytes
e_type: 2 bytes
e_machine: 2 bytes

Those 20 bytes are enough for most architectures because they have so few
formats in the first place, thus they can be uniquely identified. That
also means for shell users, since 20 is smaller than 26, they can sanely
register a handler.

But for some targets (like MIPS), we need to poke further. The ELF fields
continue on:

e_entry: 4 or 8 bytes
e_phoff: 4 or 8 bytes
e_shoff: 4 or 8 bytes
e_flags: 4 bytes

We only care about e_flags here as that includes the bits to identify
whether the ELF is O32/N32/N64. But now we have to consume another 16
bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
flags. If every byte is escaped, we send 288 more bytes to the kernel
((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
{e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
budget.

Even if we try to be clever and do the decoding ourselves (rather than
relying on the kernel to process \x##), we still can't hit the mark --
string_unescape_inplace treats mask & magic as C strings so NUL cannot
be embedded. That leaves us with having to pass \x00 for the 12/24
entry/phoff/shoff bytes (as those will be completely random addresses),
and that is a minimum requirement of 48/96 bytes for the mask alone.
Add up the rest and we blow through it (this is for 64 bit ELFs):
magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
4 {e_flags} = 48 # ^^ See note below.
mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
4 {e_flags} = 120
Remember above we had 107 left over, and now we're at 168. This is of
course the *best* case scenario -- you'll also want to have NUL bytes
in the magic & mask too to match literal zeros.

Note: the reason we can use 24 in the magic is that we can work off of the
fact that for bytes the mask would clobber, we can stuff any value into
magic that we want. So when mask is \x00, we don't need the magic to also
be \x00, it can be an unescaped raw byte like '!'. This lets us handle
more formats (barely) under the current 256 limit, but that's a pretty
tall hoop to force people to jump through.

With all that said, let's bump the limit from 256 bytes to 1920. This way
we support escaping every byte of the mask & magic field (which is 1024
bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
fields. Like long paths to the interpreter (when you have source in your
/really/long/homedir/qemu/foo). Since the current code stuffs more than
one structure into the same buffer, we leave a bit of space to easily
round up to 2k. 1920 is just as arbitrary as 256 ;).

Signed-off-by: Mike Frysinger
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2014-10-14 08:18:15 +0800

04 Apr, 2014

1 commit

b003f9650 binfmt_misc: add missing 'break' statement ... Browse Code »

A missing 'break' statement in bm_status_write() results in a user program
receiving '3' when doing the following:

write(fd, "-1", 2);

Signed-off-by: Luis Henriques
Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis Henriques
2014-04-04 07:21:16 +0800

01 May, 2013

1 commit

8d82e180b binfmt_misc: reuse string_unescape_inplace() ... Browse Code »

There is string_unescape_inplace() function which decodes strings in generic
way. Let's use it.

Signed-off-by: Andy Shevchenko
Cc: Alexander Viro
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds

Andy Shevchenko
2013-05-01 08:04:03 +0800

04 Mar, 2013

1 commit

7f78e0351 fs: Limit sys_mount to only request filesystem modules. ... Browse Code »

Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn
Acked-by: Kees Cook
Reported-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-04 11:36:31 +0800

23 Feb, 2013

1 commit

496ad9aa8 new helper: file_inode(file) ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-23 12:31:31 +0800

21 Dec, 2012

1 commit

b66c59840 exec: do not leave bprm->interp on stack ... Browse Code »

If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.

Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.

After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.

This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.

For a proof of concept, see DoTest.sh from:

http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

Signed-off-by: Kees Cook
Cc: halfdog
Cc: P J P
Cc: Alexander Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-12-21 09:40:19 +0800

18 Dec, 2012

1 commit

d74026986 exec: use -ELOOP for max recursion depth ... Browse Code »

To avoid an explosion of request_module calls on a chain of abusive
scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
as maximum recursion depth is hit, the error will fail all the way back
up the chain, aborting immediately.

This also has the side-effect of stopping the user's shell from attempting
to reexecute the top-level file as a shell script. As seen in the
dash source:

if (cmd != path_bshell && errno == ENOEXEC) {
*argv-- = cmd;
*argv = cmd = path_bshell;
goto repeat;
}

The above logic was designed for running scripts automatically that lacked
the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
things continue to behave as the shell expects.

Additionally, when tracking recursion, the binfmt handlers should not be
involved. The recursion being tracked is the depth of calls through
search_binary_handler(), so that function should be exclusively responsible
for tracking the depth.

Signed-off-by: Kees Cook
Cc: halfdog
Cc: P J P
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-12-18 09:15:23 +0800

29 Nov, 2012

2 commits

71613c3b8 get rid of pt_regs argument of ->load_binary() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:53:38 +0800
3c456bfc4 get rid of pt_regs argument of search_binary_handler() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-11-29 10:53:38 +0800

06 May, 2012

1 commit

dbd5768f8 vfs: Rename end_writeback() to clear_inode() ... Browse Code »

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800

24 Mar, 2012

1 commit

b502bd115 magic.h: move some FS magic numbers into magic.h ... Browse Code »

- Move open-coded filesystem magic numbers into magic.h

- Rearrange magic.h so that the filesystem-related constants are grouped
together.

Signed-off-by: Muthukumar R
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Muthu Kumar
2012-03-24 07:58:31 +0800

21 Mar, 2012

1 commit

8fc3dc5a3 __register_binfmt() made void ... Browse Code »

Just don't pass NULL to it - nobody does, anyway.

Signed-off-by: Al Viro

Al Viro
2012-03-21 09:29:46 +0800

07 Jan, 2012

1 commit

d8c9584ea vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-07 12:16:53 +0800

02 Nov, 2011

1 commit

6d6b77f16 filesystems: add missing nlink wrappers ... Browse Code »

Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2011-11-02 19:53:43 +0800

20 Jul, 2011

1 commit

1b5d783c9 consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling ... Browse Code »

new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.

Signed-off-by: Al Viro

Al Viro
2011-07-20 13:43:10 +0800