Eric Lee / smarc-fsl-linux-kernel

13 Jun, 2013

1 commit

cf7df378a reboot: rigrate shutdown/reboot to boot cpu ... Browse Code »

We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").

The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.

This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.

Signed-off-by: Robin Holt
Tested-by: Shawn Guo
Cc: "Srivatsa S. Bhat"
Cc: H. Peter Anvin
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Russ Anderson
Cc: Robin Holt
Cc: Russell King
Cc: Guan Xuetao
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
12 years ago

01 May, 2013

3 commits

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
12 years ago
52b369415 kernel/sys.c: make prctl(PR_SET_MM) generally available ... Browse Code »

The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

start_code, end_code, start_data, end_data, start_brk, brk,
start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file. This includes:

1. Restoring processes from a checkpoint-file (by all potential
user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
and high-availablity).
4. Starting a process from an executable format that is not supported
by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
executable that, for confidentiality, licensing or other reasons,
may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled. The patch removes those ifdefs.

Signed-off-by: Amnon Shiloh
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amnon Shiloh
12 years ago
4a22f1663 kernel/timer.c: move some non timer related syscalls to kernel/sys.c ... Browse Code »

Andrew Morton noted:

akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
SYSCALL_DEFINE1(alarm, unsigned int, seconds)
SYSCALL_DEFINE0(getpid)
SYSCALL_DEFINE0(getppid)
SYSCALL_DEFINE0(getuid)
SYSCALL_DEFINE0(geteuid)
SYSCALL_DEFINE0(getgid)
SYSCALL_DEFINE0(getegid)
SYSCALL_DEFINE0(gettid)
SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

Only one of those should be in kernel/timer.c. Who wrote this thing?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Rothwell
Acked-by: Thomas Gleixner
Cc: Guenter Roeck
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Rothwell
12 years ago

09 Apr, 2013

1 commit

6f389a8f1 PM / reboot: call syscore_shutdown() after disable_nonboot_cpus() ... Browse Code »

As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
subsystems PM) say, syscore_ops operations should be carried with one
CPU on-line and interrupts disabled. However, after commit f96972f2d
(kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
syscore_shutdown() is called before disable_nonboot_cpus(), so break
the rules. We have a MIPS machine with a 8259A PIC, and there is an
external timer (HPET) linked at 8259A. Since 8259A has been shutdown
too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
timer interrupt, so it hangs and reboot fails. This patch call
syscore_shutdown() a little later (after disable_nonboot_cpus()) to
avoid reboot failure, this is the same way as poweroff does.

For consistency, add disable_nonboot_cpus() to kernel_halt().

Signed-off-by: Huacai Chen
Cc:
Signed-off-by: Rafael J. Wysocki

Huacai Chen
12 years ago

23 Mar, 2013

1 commit

2ca067efd poweroff: change orderly_poweroff() to use schedule_work() ... Browse Code »

David said:

Commit 6c0c0d4d1080 ("poweroff: fix bug in orderly_poweroff()")
apparently fixes one bug in orderly_poweroff(), but introduces
another. The comments on orderly_poweroff() claim it can be called
from any context - and indeed we call it from interrupt context in
arch/powerpc/platforms/pseries/ras.c for example. But since that
commit this is no longer safe, since call_usermodehelper_fns() is not
safe in interrupt context without the UMH_NO_WAIT option.

orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
sleepable. Move the "force" logic into __orderly_poweroff() and change
orderly_poweroff() to use the global poweroff_work which simply calls
__orderly_poweroff().

While at it, remove the unneeded "int argc" and change argv_split() to
use GFP_KERNEL.

We use the global "bool poweroff_force" to pass the argument, this can
obviously affect the previous request if it is pending/running. So we
only allow the "false => true" transition assuming that the pending
"true" should succeed anyway. If schedule_work() fails after that we
know that work->func() was not called yet, it must see the new value.

This means that orderly_poweroff() becomes async even if we do not run
the command and always succeeds, schedule_work() can only fail if the
work is already pending. We can export __orderly_poweroff() and change
the non-atomic callers which want the old semantics.

Signed-off-by: Oleg Nesterov
Reported-by: Benjamin Herrenschmidt
Reported-by: David Gibson
Cc: Lucas De Marchi
Cc: Feng Hong
Cc: Kees Cook
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
12 years ago

04 Mar, 2013

1 commit

8d2d5c4a2 switch getrusage() to COMPAT_SYSCALL_DEFINE ... Browse Code »

Signed-off-by: Al Viro

Al Viro
12 years ago

28 Feb, 2013

1 commit

7ff676406 usermodehelper: cleanup/fix __orderly_poweroff() && argv_free() ... Browse Code »

__orderly_poweroff() does argv_free() if call_usermodehelper_fns()
returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was
not triggered by the failing call_usermodehelper_setup(), in this case
both __orderly_poweroff() and argv_cleanup() can do kfree().

Kill argv_cleanup() and change __orderly_poweroff() to call argv_free()
unconditionally like do_coredump() does. This info->cleanup() is not
needed (and wrong) since 6c0c0d4d "fix bug in orderly_poweroff() which
did the UMH_NO_WAIT => UMH_WAIT_EXEC change, we can rely on the fact
that CLONE_VFORK can't return until do_execve() succeeds/fails.

Signed-off-by: Oleg Nesterov
Reported-by: Lucas De Marchi
Cc: David Howells
Cc: James Morris
Cc: hongfeng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
12 years ago

27 Feb, 2013

1 commit

d895cb1af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...

Linus Torvalds
12 years ago

26 Feb, 2013

1 commit

94f2f1423 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.

I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.

There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.

The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.

XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...

Linus Torvalds
12 years ago

23 Feb, 2013

1 commit

496ad9aa8 new helper: file_inode(file) ... Browse Code »

Signed-off-by: Al Viro

Al Viro
12 years ago

22 Feb, 2013

2 commits

f3cbd435b sys_prctl(): coding-style cleanup ... Browse Code »

Remove a tabstop from the switch statement, in the usual fashion. A few
instances of weirdwrapping were removed as a result.

Cc: Chen Gang
Cc: Cyrill Gorcunov
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
12 years ago
7fe5e0429 sys_prctl(): arg2 is unsigned long which is never < 0 ... Browse Code »

arg2 will never < 0, for its type is 'unsigned long'

Also, use the provided macros.

Signed-off-by: Chen Gang
Reported-by: Cyrill Gorcunov
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
12 years ago

27 Dec, 2012

1 commit

923c75382 userns: Allow unprivileged reboot ... Browse Code »

In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.

Signed-off-by: Li Zefan
Signed-off-by: Eric W. Biederman

Li Zefan
13 years ago

29 Nov, 2012

1 commit

e80d0a1ae cputime: Rename thread_group_times to thread_group_cputime_adjusted ... Browse Code »

We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.

To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Steven Rostedt
Cc: Paul Gortmaker

Frederic Weisbecker
13 years ago

20 Oct, 2012

2 commits

31fd84b95 use clamp_t in UNAME26 fix ... Browse Code »

The min/max call needed to have explicit types on some architectures
(e.g. mn10300). Use clamp_t instead to avoid the warning:

kernel/sys.c: In function 'override_release':
kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]

Reported-by: Fengguang Wu
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds

Kees Cook
13 years ago
2702b1526 kernel/sys.c: fix stack memory content leak via UNAME26 ... Browse Code »

Calling uname() with the UNAME26 personality set allows a leak of kernel
stack contents. This fixes it by defensively calculating the length of
copy_to_user() call, making the len argument unsigned, and initializing
the stack buffer to zero (now technically unneeded, but hey, overkill).

CVE-2012-0957

Reported-by: PaX Team
Signed-off-by: Kees Cook
Cc: Andi Kleen
Cc: PaX Team
Cc: Brad Spengler
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
13 years ago

06 Oct, 2012

2 commits

6c0c0d4d1 poweroff: fix bug in orderly_poweroff() ... Browse Code »

orderly_poweroff is trying to poweroff platform in two steps:

step 1: Call user space application to poweroff
step 2: If user space poweroff fail, then do a force power off if force param
is set.

The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
the design goal of orderly_poweroff.

We have two choices here:
UMH_WAIT_EXEC which means wait for the exec, but not the process;
UMH_WAIT_PROC which means wait for the process to complete.
we need to trade off the two choices:

If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
Hallyn: The exec will have started, but may for whatever (very unlikely)
reason fail.

If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
Biederman: If the caller is not running in a kernel thread then we can
easily get into a case where the user space caller will block waiting for
us when we are waiting for the user space caller.

Thanks for their excellent ideas, based on the above discussion, we
finally choose UMH_WAIT_EXEC, which is much more safe, if the user
application really fails, we just complain the application itself, it
seems a better choice here.

Signed-off-by: Feng Hong
Acked-by: Kees Cook
Acked-by: Serge Hallyn
Cc: "Eric W. Biederman"
Acked-by: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

hongfeng
13 years ago
f96972f2d kernel/sys.c: call disable_nonboot_cpus() in kernel_restart() ... Browse Code »

As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
have kernel_restart() call disable_nonboot_cpus(). Doing so can help
machines that require boot cpu be the last alive cpu during reboot to
survive with kernel restart.

This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
requires that the restart routine be run on the primary cpu rather than
secondary ones. Otherwise, the secondary core running the restart
routine will fail to come to online after reboot.

Signed-off-by: Shawn Guo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shawn Guo
13 years ago

27 Sep, 2012

2 commits

2903ff019 switch simple cases of fget_light to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
13 years ago
e10ce27f0 switch prctl_set_mm_exe_file() to fget_light() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
13 years ago

31 Jul, 2012

2 commits

b57b44ae6 kernel/sys.c: avoid argv_free(NULL) ... Browse Code »

If argv_split() failed, the code will end up calling argv_free(NULL). Fix
it up and clean things up a bit.

Addresses Coverity report 703573.

Cc: Cyrill Gorcunov
Cc: Kees Cook
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: WANG Cong
Cc: Alan Cox
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
13 years ago
f1fd75bfa prctl: remove redunant assignment of "error" to zero ... Browse Code »

Just setting the "error" to error number is enough on failure and It
doesn't require to set "error" variable to zero in each switch case,
since it was already initialized with zero. And also removed return 0
in switch case with break statement

Signed-off-by: Sasikantha babu
Acked-by: Kees Cook
Acked-by: Serge E. Hallyn
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasikantha babu
13 years ago

12 Jul, 2012

1 commit

4229fb1dc c/r: prctl: less paranoid prctl_set_mm_exe_file() ... Browse Code »

"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.

Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.

Plus, this patch fixes exit code in case operation success.

Signed-off-by: Konstantin Khlebnikov
Reported-by: Cyrill Gorcunov
Tested-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Cc: Matt Helsley
Cc: Kees Cook
Cc: KOSAKI Motohiro
Cc: Tejun Heo
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
13 years ago

21 Jun, 2012

1 commit

5702c5eea c/r: prctl: Move PR_GET_TID_ADDRESS to a proper place ... Browse Code »

During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
happened to appear under PR_MCE_KILL) in result noone can use this option.

Fix it by moving code snippet to a proper place.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Andrey Vagin
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago

08 Jun, 2012

4 commits

736f24d5e c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment ... Browse Code »

In commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") the stack allocated via clone() is marked in
/proc//maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).

So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.

As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.

Signed-off-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Acked-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago
300f786b2 c/r: prctl: add ability to get clear_tid_address ... Browse Code »

Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().

We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.

Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.

This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.

This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin
Signed-off-by: Cyrill Gorcunov
Cc: Pedro Alves
Cc: Oleg Nesterov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago
1ad75b9e1 c/r: prctl: add minimal address test to PR_SET_MM ... Browse Code »

Make sure the address being set is greater than mmap_min_addr (as
suggested by Kees Cook).

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Serge Hallyn
Cc: Tejun Heo
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago
bafb282df c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal ... Browse Code »

A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
mm_struct::exe_file").

After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.

We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.

Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Cc: Matt Helsley
Cc: Kees Cook
Cc: KOSAKI Motohiro
Cc: Tejun Heo
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
13 years ago

01 Jun, 2012

4 commits

b32dfe377 c/r: prctl: add ability to set new mm_struct::exe_file ... Browse Code »
44

When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.

For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.

Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.

To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.

In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.

Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.

Signed-off-by: Cyrill Gorcunov
Reviewed-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Pavel Emelyanov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago
fe8c7f5cb c/r: prctl: extend PR_SET_MM to set up more mm_struct entries ... Browse Code »

During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.

So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.

For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
13 years ago
81ab6e7b2 kmod: convert two call sites to call_usermodehelper_fns() ... Browse Code »

Both kernel/sys.c && security/keys/request_key.c where inlining the exact
same code as call_usermodehelper_fns(); So simply convert these sites to
directly use call_usermodehelper_fns().

Signed-off-by: Boaz Harrosh
Cc: Oleg Nesterov
Cc: Tetsuo Handa
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
13 years ago
499eea6bf sethostname/setdomainname: notify userspace when there is a change in uts_kern_table ... Browse Code »

sethostname() and setdomainname() notify userspace on failure (without
modifying uts_kern_table). Change things so that we only notify userspace
on success, when uts_kern_table was actually modified.

Signed-off-by: Sasikantha babu
Cc: Paul Gortmaker
Cc: Greg Kroah-Hartman
Cc: WANG Cong
Reviewed-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasikantha babu
13 years ago

24 May, 2012

1 commit

644473e9c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.

Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.

- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.

- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.

- With the user namespaces compiled out the performance is as good or
better than it is today.

- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.

- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).

- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.

- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.

- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.

- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.

My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...

Linus Torvalds
13 years ago

03 May, 2012

3 commits

5af662030 userns: Convert ptrace, kill, set_priority permission checks to work with kuids and kgids ... Browse Code »

Update the permission checks to use the new uid_eq and gid_eq helpers
and remove the now unnecessary user_ns equality comparison.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
13 years ago
a29c33f4e userns: Convert setting and getting uid and gid system calls to use kuid and kgid ... Browse Code »

Convert setregid, setgid, setreuid, setuid,
setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid,
getuid, geteuid, getgid, getegid,
waitpid, waitid, wait4.

Convert userspace uids and gids into kuids and kgids before
being placed on struct cred. Convert struct cred kuids and
kgids into userspace uids and gids when returning them.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
13 years ago
078de5f70 userns: Store uid and gid values in struct cred with kuid_t and kgid_t types ... Browse Code »

cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
13 years ago

14 Apr, 2012

2 commits

e2cfabdfd seccomp: add system call filtering using BPF ... Browse Code »

[This patch depends on luto@mit.edu's no_new_privs patch:
https://lkml.org/lkml/2012/1/30/264
The whole series including Andrew's patches can be found here:
https://github.com/redpig/linux/tree/seccomp
Complete diff here:
https://github.com/redpig/linux/compare/1dc65fed...seccomp
]

This patch adds support for seccomp mode 2. Mode 2 introduces the
ability for unprivileged processes to install system call filtering
policy expressed in terms of a Berkeley Packet Filter (BPF) program.
This program will be evaluated in the kernel for each system call
the task makes and computes a result based on data in the format
of struct seccomp_data.

A filter program may be installed by calling:
struct sock_fprog fprog = { ... };
...
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

The return value of the filter program determines if the system call is
allowed to proceed or denied. If the first filter program installed
allows prctl(2) calls, then the above call may be made repeatedly
by a task to further reduce its access to the kernel. All attached
programs must be evaluated before a system call will be allowed to
proceed.

Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time
- BPF optimization (and JIT'ing) are well understood
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on access only to minimize copying
required for system call policy decisions.

Mode 2 support is restricted to architectures that enable
HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
syscall_get_arguments(). The full desired scope of this feature will
add a few minor additional requirements expressed later in this series.
Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
the desired additional functionality.

No architectures are enabled in this patch.

Signed-off-by: Will Drewry
Acked-by: Serge Hallyn
Reviewed-by: Indan Zupancic
Acked-by: Eric Paris
Reviewed-by: Kees Cook

v18: - rebase to v3.4-rc2
- s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
- allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
- add a comment for get_u32 regarding endianness (akpm@)
- fix other typos, style mistakes (akpm@)
- added acked-by
v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
- tighten return mask to 0x7fff0000
v16: - no change
v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
size (indan@nul.nu)
- drop the max insns to 256KB (indan@nul.nu)
- return ENOMEM if the max insns limit has been hit (indan@nul.nu)
- move IP checks after args (indan@nul.nu)
- drop !user_filter check (indan@nul.nu)
- only allow explicit bpf codes (indan@nul.nu)
- exit_code -> exit_sig
v14: - put/get_seccomp_filter takes struct task_struct
(indan@nul.nu,keescook@chromium.org)
- adds seccomp_chk_filter and drops general bpf_run/chk_filter user
- add seccomp_bpf_load for use by net/core/filter.c
- lower max per-process/per-hierarchy: 1MB
- moved nnp/capability check prior to allocation
(all of the above: indan@nul.nu)
v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
- removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
- reworded the prctl_set_seccomp comment (indan@nul.nu)
v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
- style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
- do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
- pare down Kconfig doc reference.
- extra comment clean up
v10: - seccomp_data has changed again to be more aesthetically pleasing
(hpa@zytor.com)
- calling convention is noted in a new u32 field using syscall_get_arch.
This allows for cross-calling convention tasks to use seccomp filters.
(hpa@zytor.com)
- lots of clean up (thanks, Indan!)
v9: - n/a
v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
- Lots of fixes courtesy of indan@nul.nu:
-- fix up load behavior, compat fixups, and merge alloc code,
-- renamed pc and dropped __packed, use bool compat.
-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
dependencies
v7: (massive overhaul thanks to Indan, others)
- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
- merged into seccomp.c
- minimal seccomp_filter.h
- no config option (part of seccomp)
- no new prctl
- doesn't break seccomp on systems without asm/syscall.h
(works but arg access always fails)
- dropped seccomp_init_task, extra free functions, ...
- dropped the no-asm/syscall.h code paths
- merges with network sk_run_filter and sk_chk_filter
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. (luto@mit.edu)
- s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
- cleaned up Kconfig (amwang@redhat.com)
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
(indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
(mcgrathr@chromium.org, indan@nul.nu)
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t (eparis@redhat.com)
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
(luto@mit.edu,torvalds@linux-foundation.com)
- assign names to seccomp modes (rdunlap@xenotime.net)
- fix style issues (rdunlap@xenotime.net)
- reworded Kconfig entry (rdunlap@xenotime.net)
v3: - macros to inline (oleg@redhat.com)
- init_task behavior fixed (oleg@redhat.com)
- drop creator entry and extra NULL check (oleg@redhat.com)
- alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
- adds tentative use of "always_unprivileged" as per
torvalds@linux-foundation.org and luto@mit.edu
v2: - (patch 2 only)
Signed-off-by: James Morris

Will Drewry
13 years ago
259e5e6c7 Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs ... Browse Code »

With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.

Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.

To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)

This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.

Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.

Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.

Signed-off-by: Andy Lutomirski
Signed-off-by: Will Drewry
Acked-by: Eric Paris
Acked-by: Kees Cook

v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris

Andy Lutomirski
13 years ago

08 Apr, 2012

1 commit

7b44ab978 userns: Disassociate user_struct from the user_namespace. ... Browse Code »

Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.

This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
13 years ago