13 Jun, 2013

1 commit

  • We recently noticed that reboot of a 1024 cpu machine takes approx 16
    minutes of just stopping the cpus. The slowdown was tracked to commit
    f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
    kernel_restart()").

    The current implementation does all the work of hot removing the cpus
    before halting the system. We are switching to just migrating to the
    boot cpu and then continuing with shutdown/reboot.

    This also has the effect of not breaking x86's command line parameter
    for specifying the reboot cpu. Note, this code was shamelessly copied
    from arch/x86/kernel/reboot.c with bits removed pertaining to the
    reboot_cpu command line parameter.

    Signed-off-by: Robin Holt
    Tested-by: Shawn Guo
    Cc: "Srivatsa S. Bhat"
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

01 May, 2013

3 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • The purpose of this patch is to allow privileged processes to set
    their own per-memory memory-region fields:

    start_code, end_code, start_data, end_data, start_brk, brk,
    start_stack, arg_start, arg_end, env_start, env_end.

    This functionality is needed by any application or package that needs to
    reconstruct Linux processes, that is, to start them in any way other than
    by means of an "execve()" from an executable file. This includes:

    1. Restoring processes from a checkpoint-file (by all potential
    user-level checkpointing packages, not only CRIU's).
    2. Restarting processes on another node after process migration.
    3. Starting duplicated copies of a running process (for reliability
    and high-availablity).
    4. Starting a process from an executable format that is not supported
    by Linux, thus requiring a "manual execve" by a user-level utility.
    5. Similarly, starting a process from a networked and/or crypted
    executable that, for confidentiality, licensing or other reasons,
    may not be written to the local file-systems.

    The code that does that was already included in the Linux kernel by the
    CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
    enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
    normally disabled. The patch removes those ifdefs.

    Signed-off-by: Amnon Shiloh
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amnon Shiloh
     
  • Andrew Morton noted:

    akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
    SYSCALL_DEFINE1(alarm, unsigned int, seconds)
    SYSCALL_DEFINE0(getpid)
    SYSCALL_DEFINE0(getppid)
    SYSCALL_DEFINE0(getuid)
    SYSCALL_DEFINE0(geteuid)
    SYSCALL_DEFINE0(getgid)
    SYSCALL_DEFINE0(getegid)
    SYSCALL_DEFINE0(gettid)
    SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
    COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

    Only one of those should be in kernel/timer.c. Who wrote this thing?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Rothwell
    Acked-by: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

09 Apr, 2013

1 commit

  • As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
    subsystems PM) say, syscore_ops operations should be carried with one
    CPU on-line and interrupts disabled. However, after commit f96972f2d
    (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
    syscore_shutdown() is called before disable_nonboot_cpus(), so break
    the rules. We have a MIPS machine with a 8259A PIC, and there is an
    external timer (HPET) linked at 8259A. Since 8259A has been shutdown
    too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
    timer interrupt, so it hangs and reboot fails. This patch call
    syscore_shutdown() a little later (after disable_nonboot_cpus()) to
    avoid reboot failure, this is the same way as poweroff does.

    For consistency, add disable_nonboot_cpus() to kernel_halt().

    Signed-off-by: Huacai Chen
    Cc:
    Signed-off-by: Rafael J. Wysocki

    Huacai Chen
     

23 Mar, 2013

1 commit

  • David said:

    Commit 6c0c0d4d1080 ("poweroff: fix bug in orderly_poweroff()")
    apparently fixes one bug in orderly_poweroff(), but introduces
    another. The comments on orderly_poweroff() claim it can be called
    from any context - and indeed we call it from interrupt context in
    arch/powerpc/platforms/pseries/ras.c for example. But since that
    commit this is no longer safe, since call_usermodehelper_fns() is not
    safe in interrupt context without the UMH_NO_WAIT option.

    orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
    sleepable. Move the "force" logic into __orderly_poweroff() and change
    orderly_poweroff() to use the global poweroff_work which simply calls
    __orderly_poweroff().

    While at it, remove the unneeded "int argc" and change argv_split() to
    use GFP_KERNEL.

    We use the global "bool poweroff_force" to pass the argument, this can
    obviously affect the previous request if it is pending/running. So we
    only allow the "false => true" transition assuming that the pending
    "true" should succeed anyway. If schedule_work() fails after that we
    know that work->func() was not called yet, it must see the new value.

    This means that orderly_poweroff() becomes async even if we do not run
    the command and always succeeds, schedule_work() can only fail if the
    work is already pending. We can export __orderly_poweroff() and change
    the non-atomic callers which want the old semantics.

    Signed-off-by: Oleg Nesterov
    Reported-by: Benjamin Herrenschmidt
    Reported-by: David Gibson
    Cc: Lucas De Marchi
    Cc: Feng Hong
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Mar, 2013

1 commit


28 Feb, 2013

1 commit

  • __orderly_poweroff() does argv_free() if call_usermodehelper_fns()
    returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was
    not triggered by the failing call_usermodehelper_setup(), in this case
    both __orderly_poweroff() and argv_cleanup() can do kfree().

    Kill argv_cleanup() and change __orderly_poweroff() to call argv_free()
    unconditionally like do_coredump() does. This info->cleanup() is not
    needed (and wrong) since 6c0c0d4d "fix bug in orderly_poweroff() which
    did the UMH_NO_WAIT => UMH_WAIT_EXEC change, we can rely on the fact
    that CLONE_VFORK can't return until do_execve() succeeds/fails.

    Signed-off-by: Oleg Nesterov
    Reported-by: Lucas De Marchi
    Cc: David Howells
    Cc: James Morris
    Cc: hongfeng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

1 commit

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

2 commits


27 Dec, 2012

1 commit

  • In a container with its own pid namespace and user namespace, rebooting
    the system won't reboot the host, but terminate all the processes in
    it and thus have the container shutdown, so it's safe.

    Signed-off-by: Li Zefan
    Signed-off-by: Eric W. Biederman

    Li Zefan
     

29 Nov, 2012

1 commit

  • We have thread_group_cputime() and thread_group_times(). The naming
    doesn't provide enough information about the difference between
    these two APIs.

    To lower the confusion, rename thread_group_times() to
    thread_group_cputime_adjusted(). This name better suggests that
    it's a version of thread_group_cputime() that does some stabilization
    on the raw cputime values. ie here: scale on top of CFS runtime
    stats and bound lower value for monotonicity.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

20 Oct, 2012

2 commits

  • The min/max call needed to have explicit types on some architectures
    (e.g. mn10300). Use clamp_t instead to avoid the warning:

    kernel/sys.c: In function 'override_release':
    kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]

    Reported-by: Fengguang Wu
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Calling uname() with the UNAME26 personality set allows a leak of kernel
    stack contents. This fixes it by defensively calculating the length of
    copy_to_user() call, making the len argument unsigned, and initializing
    the stack buffer to zero (now technically unneeded, but hey, overkill).

    CVE-2012-0957

    Reported-by: PaX Team
    Signed-off-by: Kees Cook
    Cc: Andi Kleen
    Cc: PaX Team
    Cc: Brad Spengler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

06 Oct, 2012

2 commits

  • orderly_poweroff is trying to poweroff platform in two steps:

    step 1: Call user space application to poweroff
    step 2: If user space poweroff fail, then do a force power off if force param
    is set.

    The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
    the design goal of orderly_poweroff.

    We have two choices here:
    UMH_WAIT_EXEC which means wait for the exec, but not the process;
    UMH_WAIT_PROC which means wait for the process to complete.
    we need to trade off the two choices:

    If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
    Hallyn: The exec will have started, but may for whatever (very unlikely)
    reason fail.

    If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
    Biederman: If the caller is not running in a kernel thread then we can
    easily get into a case where the user space caller will block waiting for
    us when we are waiting for the user space caller.

    Thanks for their excellent ideas, based on the above discussion, we
    finally choose UMH_WAIT_EXEC, which is much more safe, if the user
    application really fails, we just complain the application itself, it
    seems a better choice here.

    Signed-off-by: Feng Hong
    Acked-by: Kees Cook
    Acked-by: Serge Hallyn
    Cc: "Eric W. Biederman"
    Acked-by: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    hongfeng
     
  • As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
    have kernel_restart() call disable_nonboot_cpus(). Doing so can help
    machines that require boot cpu be the last alive cpu during reboot to
    survive with kernel restart.

    This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
    requires that the restart routine be run on the primary cpu rather than
    secondary ones. Otherwise, the secondary core running the restart
    routine will fail to come to online after reboot.

    Signed-off-by: Shawn Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Guo
     

27 Sep, 2012

2 commits


31 Jul, 2012

2 commits

  • If argv_split() failed, the code will end up calling argv_free(NULL). Fix
    it up and clean things up a bit.

    Addresses Coverity report 703573.

    Cc: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: WANG Cong
    Cc: Alan Cox
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Just setting the "error" to error number is enough on failure and It
    doesn't require to set "error" variable to zero in each switch case,
    since it was already initialized with zero. And also removed return 0
    in switch case with break statement

    Signed-off-by: Sasikantha babu
    Acked-by: Kees Cook
    Acked-by: Serge E. Hallyn
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasikantha babu
     

12 Jul, 2012

1 commit

  • "no other files mapped" requirement from my previous patch (c/r: prctl:
    update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
    paranoid, it forbids operation even if there mapped one shared-anon vma.

    Let's check that current mm->exe_file already unmapped, in this case
    exe_file symlink already outdated and its changing is reasonable.

    Plus, this patch fixes exit code in case operation success.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Cyrill Gorcunov
    Tested-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: Kees Cook
    Cc: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

21 Jun, 2012

1 commit

  • During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
    happened to appear under PR_MCE_KILL) in result noone can use this option.

    Fix it by moving code snippet to a proper place.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Andrey Vagin
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

08 Jun, 2012

4 commits

  • In commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") the stack allocated via clone() is marked in
    /proc//maps as [stack:%d] thus it might be out of the former
    mm->start_stack/end_stack values (and even has some custom VMA flags
    set).

    So to be able to restore mm->start_stack/end_stack drop vma flags test,
    but still require the underlying VMA to exist.

    As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
    requires CAP_SYS_RESOURCE to be granted.

    Signed-off-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Zero is written at clear_tid_address when the process exits. This
    functionality is used by pthread_join().

    We already have sys_set_tid_address() to change this address for the
    current task but there is no way to obtain it from user space.

    Without the ability to find this address and dump it we can't restore
    pthread'ed apps which call pthread_join() once they have been restored.

    This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
    the current process to obtain own clear_tid_address.

    This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

    [akpm@linux-foundation.org: fix prctl numbering]
    Signed-off-by: Andrew Vagin
    Signed-off-by: Cyrill Gorcunov
    Cc: Pedro Alves
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Make sure the address being set is greater than mmap_min_addr (as
    suggested by Kees Cook).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Serge Hallyn
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
    mm_struct::exe_file").

    After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
    final mmput(), it never becomes NULL while task is alive.

    We can check for other mapped files in mm instead of checking
    mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
    order to forbid second changing of mm->exe_file.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: Kees Cook
    Cc: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Jun, 2012

4 commits

  • When we do restore we would like to have a way to setup a former
    mm_struct::exe_file so that /proc/pid/exe would point to the original
    executable file a process had at checkpoint time.

    For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
    file descriptor which will be set as a source for new /proc/$pid/exe
    symlink.

    Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
    vmas present for current process, simply because this feature is a special
    to C/R and mm::num_exe_file_vmas become meaningless after that.

    To minimize the amount of transition the /proc/pid/exe symlink might have,
    this feature is implemented in one-shot manner. Thus once changed the
    symlink can't be changed again. This should help sysadmins to monitor the
    symlinks over all process running in a system.

    In particular one could make a snapshot of processes and ring alarm if
    there unexpected changes of /proc/pid/exe's in a system.

    Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
    the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
    request to change symlink will be rejected.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • During checkpoint we dump whole process memory to a file and the dump
    includes process stack memory. But among stack data itself, the stack
    carries additional parameters such as command line arguments, environment
    data and auxiliary vector.

    So when we do restore procedure and once we've restored stack data itself
    we need to setup mm_struct::arg_start/end, env_start/end, so restored
    process would be able to find command line arguments and environment data
    it had at checkpoint time. The same applies to auxiliary vector.

    For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
    ENV_END | AUXV) codes are introduced.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Both kernel/sys.c && security/keys/request_key.c where inlining the exact
    same code as call_usermodehelper_fns(); So simply convert these sites to
    directly use call_usermodehelper_fns().

    Signed-off-by: Boaz Harrosh
    Cc: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • sethostname() and setdomainname() notify userspace on failure (without
    modifying uts_kern_table). Change things so that we only notify userspace
    on success, when uts_kern_table was actually modified.

    Signed-off-by: Sasikantha babu
    Cc: Paul Gortmaker
    Cc: Greg Kroah-Hartman
    Cc: WANG Cong
    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasikantha babu
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

03 May, 2012

3 commits


14 Apr, 2012

2 commits

  • [This patch depends on luto@mit.edu's no_new_privs patch:
    https://lkml.org/lkml/2012/1/30/264
    The whole series including Andrew's patches can be found here:
    https://github.com/redpig/linux/tree/seccomp
    Complete diff here:
    https://github.com/redpig/linux/compare/1dc65fed...seccomp
    ]

    This patch adds support for seccomp mode 2. Mode 2 introduces the
    ability for unprivileged processes to install system call filtering
    policy expressed in terms of a Berkeley Packet Filter (BPF) program.
    This program will be evaluated in the kernel for each system call
    the task makes and computes a result based on data in the format
    of struct seccomp_data.

    A filter program may be installed by calling:
    struct sock_fprog fprog = { ... };
    ...
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

    The return value of the filter program determines if the system call is
    allowed to proceed or denied. If the first filter program installed
    allows prctl(2) calls, then the above call may be made repeatedly
    by a task to further reduce its access to the kernel. All attached
    programs must be evaluated before a system call will be allowed to
    proceed.

    Filter programs will be inherited across fork/clone and execve.
    However, if the task attaching the filter is unprivileged
    (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
    ensures that unprivileged tasks cannot attach filters that affect
    privileged tasks (e.g., setuid binary).

    There are a number of benefits to this approach. A few of which are
    as follows:
    - BPF has been exposed to userland for a long time
    - BPF optimization (and JIT'ing) are well understood
    - Userland already knows its ABI: system call numbers and desired
    arguments
    - No time-of-check-time-of-use vulnerable data accesses are possible.
    - system call arguments are loaded on access only to minimize copying
    required for system call policy decisions.

    Mode 2 support is restricted to architectures that enable
    HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
    syscall_get_arguments(). The full desired scope of this feature will
    add a few minor additional requirements expressed later in this series.
    Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
    the desired additional functionality.

    No architectures are enabled in this patch.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Indan Zupancic
    Acked-by: Eric Paris
    Reviewed-by: Kees Cook

    v18: - rebase to v3.4-rc2
    - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
    - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
    - add a comment for get_u32 regarding endianness (akpm@)
    - fix other typos, style mistakes (akpm@)
    - added acked-by
    v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
    - tighten return mask to 0x7fff0000
    v16: - no change
    v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
    size (indan@nul.nu)
    - drop the max insns to 256KB (indan@nul.nu)
    - return ENOMEM if the max insns limit has been hit (indan@nul.nu)
    - move IP checks after args (indan@nul.nu)
    - drop !user_filter check (indan@nul.nu)
    - only allow explicit bpf codes (indan@nul.nu)
    - exit_code -> exit_sig
    v14: - put/get_seccomp_filter takes struct task_struct
    (indan@nul.nu,keescook@chromium.org)
    - adds seccomp_chk_filter and drops general bpf_run/chk_filter user
    - add seccomp_bpf_load for use by net/core/filter.c
    - lower max per-process/per-hierarchy: 1MB
    - moved nnp/capability check prior to allocation
    (all of the above: indan@nul.nu)
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
    - removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
    - reworded the prctl_set_seccomp comment (indan@nul.nu)
    v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
    - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
    - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
    - pare down Kconfig doc reference.
    - extra comment clean up
    v10: - seccomp_data has changed again to be more aesthetically pleasing
    (hpa@zytor.com)
    - calling convention is noted in a new u32 field using syscall_get_arch.
    This allows for cross-calling convention tasks to use seccomp filters.
    (hpa@zytor.com)
    - lots of clean up (thanks, Indan!)
    v9: - n/a
    v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
    - Lots of fixes courtesy of indan@nul.nu:
    -- fix up load behavior, compat fixups, and merge alloc code,
    -- renamed pc and dropped __packed, use bool compat.
    -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
    dependencies
    v7: (massive overhaul thanks to Indan, others)
    - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
    - merged into seccomp.c
    - minimal seccomp_filter.h
    - no config option (part of seccomp)
    - no new prctl
    - doesn't break seccomp on systems without asm/syscall.h
    (works but arg access always fails)
    - dropped seccomp_init_task, extra free functions, ...
    - dropped the no-asm/syscall.h code paths
    - merges with network sk_run_filter and sk_chk_filter
    v6: - fix memory leak on attach compat check failure
    - require no_new_privs || CAP_SYS_ADMIN prior to filter
    installation. (luto@mit.edu)
    - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
    - cleaned up Kconfig (amwang@redhat.com)
    - on block, note if the call was compat (so the # means something)
    v5: - uses syscall_get_arguments
    (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
    - uses union-based arg storage with hi/lo struct to
    handle endianness. Compromises between the two alternate
    proposals to minimize extra arg shuffling and account for
    endianness assuming userspace uses offsetof().
    (mcgrathr@chromium.org, indan@nul.nu)
    - update Kconfig description
    - add include/seccomp_filter.h and add its installation
    - (naive) on-demand syscall argument loading
    - drop seccomp_t (eparis@redhat.com)
    v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
    - now uses current->no_new_privs
    (luto@mit.edu,torvalds@linux-foundation.com)
    - assign names to seccomp modes (rdunlap@xenotime.net)
    - fix style issues (rdunlap@xenotime.net)
    - reworded Kconfig entry (rdunlap@xenotime.net)
    v3: - macros to inline (oleg@redhat.com)
    - init_task behavior fixed (oleg@redhat.com)
    - drop creator entry and extra NULL check (oleg@redhat.com)
    - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
    - adds tentative use of "always_unprivileged" as per
    torvalds@linux-foundation.org and luto@mit.edu
    v2: - (patch 2 only)
    Signed-off-by: James Morris

    Will Drewry
     
  • With this change, calling
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
    disables privilege granting operations at execve-time. For example, a
    process will not be able to execute a setuid binary to change their uid
    or gid if this bit is set. The same is true for file capabilities.

    Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
    LSMs respect the requested behavior.

    To determine if the NO_NEW_PRIVS bit is set, a task may call
    prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
    It returns 1 if set and 0 if it is not set. If any of the arguments are
    non-zero, it will return -1 and set errno to -EINVAL.
    (PR_SET_NO_NEW_PRIVS behaves similarly.)

    This functionality is desired for the proposed seccomp filter patch
    series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
    system call behavior for itself and its child tasks without being
    able to impact the behavior of a more privileged task.

    Another potential use is making certain privileged operations
    unprivileged. For example, chroot may be considered "safe" if it cannot
    affect privileged tasks.

    Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
    set and AppArmor is in use. It is fixed in a subsequent patch.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris
    Acked-by: Kees Cook

    v18: updated change desc
    v17: using new define values as per 3.4
    Signed-off-by: James Morris

    Andy Lutomirski
     

08 Apr, 2012

1 commit