06 Oct, 2012

2 commits

  • orderly_poweroff is trying to poweroff platform in two steps:

    step 1: Call user space application to poweroff
    step 2: If user space poweroff fail, then do a force power off if force param
    is set.

    The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
    the design goal of orderly_poweroff.

    We have two choices here:
    UMH_WAIT_EXEC which means wait for the exec, but not the process;
    UMH_WAIT_PROC which means wait for the process to complete.
    we need to trade off the two choices:

    If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
    Hallyn: The exec will have started, but may for whatever (very unlikely)
    reason fail.

    If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
    Biederman: If the caller is not running in a kernel thread then we can
    easily get into a case where the user space caller will block waiting for
    us when we are waiting for the user space caller.

    Thanks for their excellent ideas, based on the above discussion, we
    finally choose UMH_WAIT_EXEC, which is much more safe, if the user
    application really fails, we just complain the application itself, it
    seems a better choice here.

    Signed-off-by: Feng Hong
    Acked-by: Kees Cook
    Acked-by: Serge Hallyn
    Cc: "Eric W. Biederman"
    Acked-by: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    hongfeng
     
  • As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
    have kernel_restart() call disable_nonboot_cpus(). Doing so can help
    machines that require boot cpu be the last alive cpu during reboot to
    survive with kernel restart.

    This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
    requires that the restart routine be run on the primary cpu rather than
    secondary ones. Otherwise, the secondary core running the restart
    routine will fail to come to online after reboot.

    Signed-off-by: Shawn Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Guo
     

27 Sep, 2012

2 commits


31 Jul, 2012

2 commits

  • If argv_split() failed, the code will end up calling argv_free(NULL). Fix
    it up and clean things up a bit.

    Addresses Coverity report 703573.

    Cc: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: WANG Cong
    Cc: Alan Cox
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Just setting the "error" to error number is enough on failure and It
    doesn't require to set "error" variable to zero in each switch case,
    since it was already initialized with zero. And also removed return 0
    in switch case with break statement

    Signed-off-by: Sasikantha babu
    Acked-by: Kees Cook
    Acked-by: Serge E. Hallyn
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasikantha babu
     

12 Jul, 2012

1 commit

  • "no other files mapped" requirement from my previous patch (c/r: prctl:
    update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
    paranoid, it forbids operation even if there mapped one shared-anon vma.

    Let's check that current mm->exe_file already unmapped, in this case
    exe_file symlink already outdated and its changing is reasonable.

    Plus, this patch fixes exit code in case operation success.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Cyrill Gorcunov
    Tested-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: Kees Cook
    Cc: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

21 Jun, 2012

1 commit

  • During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
    happened to appear under PR_MCE_KILL) in result noone can use this option.

    Fix it by moving code snippet to a proper place.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Andrey Vagin
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

08 Jun, 2012

4 commits

  • In commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") the stack allocated via clone() is marked in
    /proc//maps as [stack:%d] thus it might be out of the former
    mm->start_stack/end_stack values (and even has some custom VMA flags
    set).

    So to be able to restore mm->start_stack/end_stack drop vma flags test,
    but still require the underlying VMA to exist.

    As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
    requires CAP_SYS_RESOURCE to be granted.

    Signed-off-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Zero is written at clear_tid_address when the process exits. This
    functionality is used by pthread_join().

    We already have sys_set_tid_address() to change this address for the
    current task but there is no way to obtain it from user space.

    Without the ability to find this address and dump it we can't restore
    pthread'ed apps which call pthread_join() once they have been restored.

    This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
    the current process to obtain own clear_tid_address.

    This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

    [akpm@linux-foundation.org: fix prctl numbering]
    Signed-off-by: Andrew Vagin
    Signed-off-by: Cyrill Gorcunov
    Cc: Pedro Alves
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Make sure the address being set is greater than mmap_min_addr (as
    suggested by Kees Cook).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Serge Hallyn
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
    mm_struct::exe_file").

    After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
    final mmput(), it never becomes NULL while task is alive.

    We can check for other mapped files in mm instead of checking
    mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
    order to forbid second changing of mm->exe_file.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Matt Helsley
    Cc: Kees Cook
    Cc: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Jun, 2012

4 commits

  • When we do restore we would like to have a way to setup a former
    mm_struct::exe_file so that /proc/pid/exe would point to the original
    executable file a process had at checkpoint time.

    For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
    file descriptor which will be set as a source for new /proc/$pid/exe
    symlink.

    Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
    vmas present for current process, simply because this feature is a special
    to C/R and mm::num_exe_file_vmas become meaningless after that.

    To minimize the amount of transition the /proc/pid/exe symlink might have,
    this feature is implemented in one-shot manner. Thus once changed the
    symlink can't be changed again. This should help sysadmins to monitor the
    symlinks over all process running in a system.

    In particular one could make a snapshot of processes and ring alarm if
    there unexpected changes of /proc/pid/exe's in a system.

    Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
    the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
    request to change symlink will be rejected.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • During checkpoint we dump whole process memory to a file and the dump
    includes process stack memory. But among stack data itself, the stack
    carries additional parameters such as command line arguments, environment
    data and auxiliary vector.

    So when we do restore procedure and once we've restored stack data itself
    we need to setup mm_struct::arg_start/end, env_start/end, so restored
    process would be able to find command line arguments and environment data
    it had at checkpoint time. The same applies to auxiliary vector.

    For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
    ENV_END | AUXV) codes are introduced.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Both kernel/sys.c && security/keys/request_key.c where inlining the exact
    same code as call_usermodehelper_fns(); So simply convert these sites to
    directly use call_usermodehelper_fns().

    Signed-off-by: Boaz Harrosh
    Cc: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • sethostname() and setdomainname() notify userspace on failure (without
    modifying uts_kern_table). Change things so that we only notify userspace
    on success, when uts_kern_table was actually modified.

    Signed-off-by: Sasikantha babu
    Cc: Paul Gortmaker
    Cc: Greg Kroah-Hartman
    Cc: WANG Cong
    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasikantha babu
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

03 May, 2012

3 commits


14 Apr, 2012

2 commits

  • [This patch depends on luto@mit.edu's no_new_privs patch:
    https://lkml.org/lkml/2012/1/30/264
    The whole series including Andrew's patches can be found here:
    https://github.com/redpig/linux/tree/seccomp
    Complete diff here:
    https://github.com/redpig/linux/compare/1dc65fed...seccomp
    ]

    This patch adds support for seccomp mode 2. Mode 2 introduces the
    ability for unprivileged processes to install system call filtering
    policy expressed in terms of a Berkeley Packet Filter (BPF) program.
    This program will be evaluated in the kernel for each system call
    the task makes and computes a result based on data in the format
    of struct seccomp_data.

    A filter program may be installed by calling:
    struct sock_fprog fprog = { ... };
    ...
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

    The return value of the filter program determines if the system call is
    allowed to proceed or denied. If the first filter program installed
    allows prctl(2) calls, then the above call may be made repeatedly
    by a task to further reduce its access to the kernel. All attached
    programs must be evaluated before a system call will be allowed to
    proceed.

    Filter programs will be inherited across fork/clone and execve.
    However, if the task attaching the filter is unprivileged
    (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
    ensures that unprivileged tasks cannot attach filters that affect
    privileged tasks (e.g., setuid binary).

    There are a number of benefits to this approach. A few of which are
    as follows:
    - BPF has been exposed to userland for a long time
    - BPF optimization (and JIT'ing) are well understood
    - Userland already knows its ABI: system call numbers and desired
    arguments
    - No time-of-check-time-of-use vulnerable data accesses are possible.
    - system call arguments are loaded on access only to minimize copying
    required for system call policy decisions.

    Mode 2 support is restricted to architectures that enable
    HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
    syscall_get_arguments(). The full desired scope of this feature will
    add a few minor additional requirements expressed later in this series.
    Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
    the desired additional functionality.

    No architectures are enabled in this patch.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Indan Zupancic
    Acked-by: Eric Paris
    Reviewed-by: Kees Cook

    v18: - rebase to v3.4-rc2
    - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
    - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
    - add a comment for get_u32 regarding endianness (akpm@)
    - fix other typos, style mistakes (akpm@)
    - added acked-by
    v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
    - tighten return mask to 0x7fff0000
    v16: - no change
    v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
    size (indan@nul.nu)
    - drop the max insns to 256KB (indan@nul.nu)
    - return ENOMEM if the max insns limit has been hit (indan@nul.nu)
    - move IP checks after args (indan@nul.nu)
    - drop !user_filter check (indan@nul.nu)
    - only allow explicit bpf codes (indan@nul.nu)
    - exit_code -> exit_sig
    v14: - put/get_seccomp_filter takes struct task_struct
    (indan@nul.nu,keescook@chromium.org)
    - adds seccomp_chk_filter and drops general bpf_run/chk_filter user
    - add seccomp_bpf_load for use by net/core/filter.c
    - lower max per-process/per-hierarchy: 1MB
    - moved nnp/capability check prior to allocation
    (all of the above: indan@nul.nu)
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
    - removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
    - reworded the prctl_set_seccomp comment (indan@nul.nu)
    v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
    - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
    - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
    - pare down Kconfig doc reference.
    - extra comment clean up
    v10: - seccomp_data has changed again to be more aesthetically pleasing
    (hpa@zytor.com)
    - calling convention is noted in a new u32 field using syscall_get_arch.
    This allows for cross-calling convention tasks to use seccomp filters.
    (hpa@zytor.com)
    - lots of clean up (thanks, Indan!)
    v9: - n/a
    v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
    - Lots of fixes courtesy of indan@nul.nu:
    -- fix up load behavior, compat fixups, and merge alloc code,
    -- renamed pc and dropped __packed, use bool compat.
    -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
    dependencies
    v7: (massive overhaul thanks to Indan, others)
    - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
    - merged into seccomp.c
    - minimal seccomp_filter.h
    - no config option (part of seccomp)
    - no new prctl
    - doesn't break seccomp on systems without asm/syscall.h
    (works but arg access always fails)
    - dropped seccomp_init_task, extra free functions, ...
    - dropped the no-asm/syscall.h code paths
    - merges with network sk_run_filter and sk_chk_filter
    v6: - fix memory leak on attach compat check failure
    - require no_new_privs || CAP_SYS_ADMIN prior to filter
    installation. (luto@mit.edu)
    - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
    - cleaned up Kconfig (amwang@redhat.com)
    - on block, note if the call was compat (so the # means something)
    v5: - uses syscall_get_arguments
    (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
    - uses union-based arg storage with hi/lo struct to
    handle endianness. Compromises between the two alternate
    proposals to minimize extra arg shuffling and account for
    endianness assuming userspace uses offsetof().
    (mcgrathr@chromium.org, indan@nul.nu)
    - update Kconfig description
    - add include/seccomp_filter.h and add its installation
    - (naive) on-demand syscall argument loading
    - drop seccomp_t (eparis@redhat.com)
    v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
    - now uses current->no_new_privs
    (luto@mit.edu,torvalds@linux-foundation.com)
    - assign names to seccomp modes (rdunlap@xenotime.net)
    - fix style issues (rdunlap@xenotime.net)
    - reworded Kconfig entry (rdunlap@xenotime.net)
    v3: - macros to inline (oleg@redhat.com)
    - init_task behavior fixed (oleg@redhat.com)
    - drop creator entry and extra NULL check (oleg@redhat.com)
    - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
    - adds tentative use of "always_unprivileged" as per
    torvalds@linux-foundation.org and luto@mit.edu
    v2: - (patch 2 only)
    Signed-off-by: James Morris

    Will Drewry
     
  • With this change, calling
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
    disables privilege granting operations at execve-time. For example, a
    process will not be able to execute a setuid binary to change their uid
    or gid if this bit is set. The same is true for file capabilities.

    Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
    LSMs respect the requested behavior.

    To determine if the NO_NEW_PRIVS bit is set, a task may call
    prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
    It returns 1 if set and 0 if it is not set. If any of the arguments are
    non-zero, it will return -1 and set errno to -EINVAL.
    (PR_SET_NO_NEW_PRIVS behaves similarly.)

    This functionality is desired for the proposed seccomp filter patch
    series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
    system call behavior for itself and its child tasks without being
    able to impact the behavior of a more privileged task.

    Another potential use is making certain privileged operations
    unprivileged. For example, chroot may be considered "safe" if it cannot
    affect privileged tasks.

    Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
    set and AppArmor is in use. It is fixed in a subsequent patch.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris
    Acked-by: Kees Cook

    v18: updated change desc
    v17: using new define values as per 3.4
    Signed-off-by: James Morris

    Andy Lutomirski
     

08 Apr, 2012

4 commits

  • Modify alloc_uid to take a kuid and make the user hash table global.
    Stop holding a reference to the user namespace in struct user_struct.

    This simplifies the code and makes the per user accounting not
    care about which user namespace a uid happens to appear in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Start distinguishing between internal kernel uids and gids and
    values that userspace can use. This is done by introducing two
    new types: kuid_t and kgid_t. These types and their associated
    functions are infrastructure are declared in the new header
    uidgid.h.

    Ultimately there will be a different implementation of the mapping
    functions for use with user namespaces. But to keep it simple
    we introduce the mapping functions first to separate the meat
    from the mechanical code conversions.

    Export overflowuid and overflowgid so we can use from_kuid_munged
    and from_kgid_munged in modular code.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Optimize performance and prepare for the removal of the user_ns reference
    from user_struct. Remove the slow long walk through cred->user->user_ns and
    instead go straight to cred->user_ns.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • In struct cred the user member is and has always been declared struct user_struct *user.
    At most a constant struct cred will have a constant pointer to non-constant user_struct
    so remove this unnecessary cast.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

29 Mar, 2012

1 commit

  • In the case of a child pid namespace, rebooting the system does not really
    makes sense. When the pid namespace is used in conjunction with the other
    namespaces in order to create a linux container, the reboot syscall leads
    to some problems.

    A container can reboot the host. That can be fixed by dropping the
    sys_reboot capability but we are unable to correctly to poweroff/
    halt/reboot a container and the container stays stuck at the shutdown time
    with the container's init process waiting indefinitively.

    After several attempts, no solution from userspace was found to reliabily
    handle the shutdown from a container.

    This patch propose to make the init process of the child pid namespace to
    exit with a signal status set to : SIGINT if the child pid namespace
    called "halt/poweroff" and SIGHUP if the child pid namespace called
    "reboot". When the reboot syscall is called and we are not in the initial
    pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
    "RESTART", and "RESTART2". Otherwise we return EINVAL.

    Returning EINVAL is also an easy way to check if this feature is supported
    by the kernel when invoking another 'reboot' option like CAD.

    By this way the parent process of the child pid namespace knows if it
    rebooted or not and can take the right decision.

    Test case:
    ==========

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include

    static int do_reboot(void *arg)
    {
    int *cmd = arg;

    if (reboot(*cmd))
    printf("failed to reboot(%d): %m\n", *cmd);
    }

    int test_reboot(int cmd, int sig)
    {
    long stack_size = 4096;
    void *stack = alloca(stack_size) + stack_size;
    int status;
    pid_t ret;

    ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
    if (ret < 0) {
    printf("failed to clone: %m\n");
    return -1;
    }

    if (wait(&status) < 0) {
    printf("unexpected wait error: %m\n");
    return -1;
    }

    if (!WIFSIGNALED(status)) {
    printf("child process exited but was not signaled\n");
    return -1;
    }

    if (WTERMSIG(status) != sig) {
    printf("signal termination is not the one expected\n");
    return -1;
    }

    return 0;
    }

    int main(int argc, char *argv[])
    {
    int status;

    status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
    if (status >= 0) {
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
    return 1;
    }
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

    return 0;
    }

    [akpm@linux-foundation.org: tweak and add comments]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Daniel Lezcano
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reviewed-by: Oleg Nesterov
    Cc: Michael Kerrisk
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

24 Mar, 2012

1 commit

  • Userspace service managers/supervisors need to track their started
    services. Many services daemonize by double-forking and get implicitly
    re-parented to PID 1. The service manager will no longer be able to
    receive the SIGCHLD signals for them, and is no longer in charge of
    reaping the children with wait(). All information about the children is
    lost at the moment PID 1 cleans up the re-parented processes.

    With this prctl, a service manager process can mark itself as a sort of
    'sub-init', able to stay as the parent for all orphaned processes
    created by the started services. All SIGCHLD signals will be delivered
    to the service manager.

    Receiving SIGCHLD and doing wait() is in cases of a service-manager much
    preferred over any possible asynchronous notification about specific
    PIDs, because the service manager has full access to the child process
    data in /proc and the PID can not be re-used until the wait(), the
    service-manager itself is in charge of, has happened.

    As a side effect, the relevant parent PID information does not get lost
    by a double-fork, which results in a more elaborate process tree and
    'ps' output:

    before:
    # ps afx
    253 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd
    328 ? S 0:00 /usr/sbin/modem-manager
    608 ? Sl 0:00 /usr/libexec/colord
    658 ? Sl 0:00 /usr/libexec/upowerd
    819 ? Sl 0:00 /usr/libexec/imsettings-daemon
    916 ? Sl 0:00 /usr/libexec/udisks-daemon
    917 ? S 0:00 \_ udisks-daemon: not polling any devices

    after:
    # ps afx
    294 ? Ss 0:00 /bin/dbus-daemon --system --nofork
    426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd
    449 ? S 0:00 \_ /usr/sbin/modem-manager
    635 ? Sl 0:00 \_ /usr/libexec/colord
    705 ? Sl 0:00 \_ /usr/libexec/upowerd
    959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon
    960 ? S 0:00 | \_ udisks-daemon: not polling any devices
    977 ? Sl 0:00 \_ /usr/libexec/packagekitd

    This prctl is orthogonal to PID namespaces. PID namespaces are isolated
    from each other, while a service management process usually requires the
    services to live in the same namespace, to be able to talk to each
    other.

    Users of this will be the systemd per-user instance, which provides
    init-like functionality for the user's login session and D-Bus, which
    activates bus services on-demand. Both need init-like capabilities to
    be able to properly keep track of the services they start.

    Many thanks to Oleg for several rounds of review and insights.

    [akpm@linux-foundation.org: fix comment layout and spelling]
    [akpm@linux-foundation.org: add lengthy code comment from Oleg]
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Lennart Poettering
    Signed-off-by: Kay Sievers
    Acked-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lennart Poettering
     

16 Mar, 2012

1 commit

  • CAP_SYS_ADMIN is already overloaded left and right, so to have more
    fine-grained access control use CAP_SYS_RESOURCE here.

    The CAP_SYS_RESOUCE is chosen because this prctl option allows a current
    process to adjust some fields of memory map descriptor which rather
    represents what the process owns: pointers to code, data, stack
    segments, command line, auxiliary vector data and etc.

    Suggested-by: Michael Kerrisk
    Acked-by: Kees Cook
    Acked-by: Michael Kerrisk
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Paul Bolle
    Cc: KOSAKI Motohiro
    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

13 Jan, 2012

1 commit

  • When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time. This patch
    adds auxilary prctl codes for that.

    While most of them have a statistical nature (their values are involved
    into calculation of /proc//statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation. So to restrict access the following requirements applied to
    prctl calls:

    - The process has to have CAP_SYS_ADMIN capability granted.
    - For all opcodes except start_brk/brk members an appropriate
    VMA area must exist and should fit certain VMA flags,
    such as:
    - code segment must be executable but not writable;
    - data segment must not be executable.

    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.

    Still the main guard is CAP_SYS_ADMIN capability check.

    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.

    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

15 Dec, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Adding support for poll() in sysctl fs allows userspace to receive
    notifications of changes in sysctl entries. This adds a infrastructure to
    allow files in sysctl fs to be pollable and implements it for hostname and
    domainname.

    [akpm@linux-foundation.org: s/declare/define/ for definitions]
    Signed-off-by: Lucas De Marchi
    Cc: Greg KH
    Cc: Kay Sievers
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     

31 Oct, 2011

2 commits

  • These files were implicitly relying on coming in via
    module.h, as without it we get things like:

    kernel/power/suspend.c:100: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/power/suspend.c:109: error: implicit declaration of function ‘usermodehelper_enable’
    kernel/power/user.c:254: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/power/user.c:261: error: implicit declaration of function ‘usermodehelper_enable’

    kernel/sys.c:317: error: implicit declaration of function ‘usermodehelper_disable’
    kernel/sys.c:1816: error: implicit declaration of function ‘call_usermodehelper_setup’
    kernel/sys.c:1822: error: implicit declaration of function ‘call_usermodehelper_setfns’
    kernel/sys.c:1824: error: implicit declaration of function ‘call_usermodehelper_exec’

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

25 Oct, 2011

1 commit


17 Oct, 2011

1 commit

  • The size is always valid, but variable-length arrays generate worse code
    for no good reason (unless the function happens to be inlined and the
    compiler sees the length for the simple constant it is).

    Also, there seems to be some code generation problem on POWER, where
    Henrik Bakken reports that register r28 can get corrupted under some
    subtle circumstances (interrupt happening at the wrong time?). That all
    indicates some seriously broken compiler issues, but since variable
    length arrays are bad regardless, there's little point in trying to
    chase it down.

    "Just don't do that, then".

    Reported-by: Henrik Grindal Bakken
    Cc: Benjamin Herrenschmidt
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Sep, 2011

1 commit

  • Add an event to monitor comm value changes of tasks. Such an event
    becomes vital, if someone desires to control threads of a process in
    different manner.

    A natural characteristic of threads is its comm value, and helpfully
    application developers have an opportunity to change it in runtime.
    Reporting about such events via proc connector allows to fine-grain
    monitoring and control potentials, for instance a process control daemon
    listening to proc connector and following comm value policies can place
    specific threads to assigned cgroup partitions.

    It might be possible to achieve a pale partial one-shot likeness without
    this update, if an application changes comm value of a thread generator
    task beforehand, then a new thread is cloned, and after that proc
    connector listener gets the fork event and reads new thread's comm value
    from procfs stat file, but this change visibly simplifies and extends the
    matter.

    Signed-off-by: Vladimir Zapolskiy
    Acked-by: Evgeniy Polyakov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Vladimir Zapolskiy
     

26 Aug, 2011

1 commit

  • I ran into a couple of programs which broke with the new Linux 3.0
    version. Some of those were binary only. I tried to use LD_PRELOAD to
    work around it, but it was quite difficult and in one case impossible
    because of a mix of 32bit and 64bit executables.

    For example, all kind of management software from HP doesnt work, unless
    we pretend to run a 2.6 kernel.

    $ uname -a
    Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux

    $ hpacucli ctrl all show

    Error: No controllers detected.

    $ rpm -qf /usr/sbin/hpacucli
    hpacucli-8.75-12.0

    Another notable case is that Python now reports "linux3" from
    sys.platform(); which in turn can break things that were checking
    sys.platform() == "linux2":

    https://bugzilla.mozilla.org/show_bug.cgi?id=664564

    It seems pretty clear to me though it's a bug in the apps that are using
    '==' instead of .startswith(), but this allows us to unbreak broken
    programs.

    This patch adds a UNAME26 personality that makes the kernel report a
    2.6.40+x version number instead. The x is the x in 3.x.

    I know this is somewhat ugly, but I didn't find a better workaround, and
    compatibility to existing programs is important.

    Some programs also read /proc/sys/kernel/osrelease. This can be worked
    around in user space with mount --bind (and a mount namespace)

    To use:

    wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
    gcc -o uname26 uname26.c
    ./uname26 program

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

12 Aug, 2011

1 commit

  • The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
    check in set_user() to check for NPROC exceeding via setuid() and
    similar functions.

    Before the check there was a possibility to greatly exceed the allowed
    number of processes by an unprivileged user if the program relied on
    rlimit only. But the check created new security threat: many poorly
    written programs simply don't check setuid() return code and believe it
    cannot fail if executed with root privileges. So, the check is removed
    in this patch because of too often privilege escalations related to
    buggy programs.

    The NPROC can still be enforced in the common code flow of daemons
    spawning user processes. Most of daemons do fork()+setuid()+execve().
    The check introduced in execve() (1) enforces the same limit as in
    setuid() and (2) doesn't create similar security issues.

    Neil Brown suggested to track what specific process has exceeded the
    limit by setting PF_NPROC_EXCEEDED process flag. With the change only
    this process would fail on execve(), and other processes' execve()
    behaviour is not changed.

    Solar Designer suggested to re-check whether NPROC limit is still
    exceeded at the moment of execve(). If the process was sleeping for
    days between set*uid() and execve(), and the NPROC counter step down
    under the limit, the defered execve() failure because NPROC limit was
    exceeded days ago would be unexpected. If the limit is not exceeded
    anymore, we clear the flag on successful calls to execve() and fork().

    The flag is also cleared on successful calls to set_user() as the limit
    was exceeded for the previous user, not the current one.

    Similar check was introduced in -ow patches (without the process flag).

    v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

    Reviewed-by: James Morris
    Signed-off-by: Vasiliy Kulikov
    Acked-by: NeilBrown
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov