21 Dec, 2012

3 commits

  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     
  • If a series of scripts are executed, each triggering module loading via
    unprintable bytes in the script header, kernel stack contents can leak
    into the command line.

    Normally execution of binfmt_script and binfmt_misc happens recursively.
    However, when modules are enabled, and unprintable bytes exist in the
    bprm->buf, execution will restart after attempting to load matching
    binfmt modules. Unfortunately, the logic in binfmt_script and
    binfmt_misc does not expect to get restarted. They leave bprm->interp
    pointing to their local stack. This means on restart bprm->interp is
    left pointing into unused stack memory which can then be copied into the
    userspace argv areas.

    After additional study, it seems that both recursion and restart remains
    the desirable way to handle exec with scripts, misc, and modules. As
    such, we need to protect the changes to interp.

    This changes the logic to require allocation for any changes to the
    bprm->interp. To avoid adding a new kmalloc to every exec, the default
    value is left as-is. Only when passing through binfmt_script or
    binfmt_misc does an allocation take place.

    For a proof of concept, see DoTest.sh from:

    http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Dec, 2012

1 commit

  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     

18 Dec, 2012

3 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • To avoid an explosion of request_module calls on a chain of abusive
    scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
    as maximum recursion depth is hit, the error will fail all the way back
    up the chain, aborting immediately.

    This also has the side-effect of stopping the user's shell from attempting
    to reexecute the top-level file as a shell script. As seen in the
    dash source:

    if (cmd != path_bshell && errno == ENOEXEC) {
    *argv-- = cmd;
    *argv = cmd = path_bshell;
    goto repeat;
    }

    The above logic was designed for running scripts automatically that lacked
    the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
    things continue to behave as the shell expects.

    Additionally, when tracking recursion, the binfmt handlers should not be
    involved. The recursion being tracked is the depth of calls through
    search_binary_handler(), so that function should be exclusively responsible
    for tracking the depth.

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

29 Nov, 2012

5 commits


19 Nov, 2012

1 commit

  • When performing an exec where the binary lives in one user namespace and
    the execing process lives in another usre namespace there is the possibility
    that the target uids can not be represented.

    Instead of failing the exec simply ignore the suid/sgid bits and run
    the binary with lower privileges. We already do this in the case
    of MNT_NOSUID so this should be a well tested code path.

    As the user and group are not changed this should not introduce any
    security issues.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

26 Oct, 2012

1 commit


13 Oct, 2012

2 commits

  • ...and fix up the callers. For do_file_open_root, just declare a
    struct filename on the stack and fill out the .name field. For
    do_filp_open, make it also take a struct filename pointer, and fix up its
    callers to call it appropriately.

    For filp_open, add a variant that takes a struct filename pointer and turn
    filp_open into a wrapper around it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

10 Oct, 2012

1 commit

  • Pull generic execve() changes from Al Viro:
    "This introduces the generic kernel_thread() and kernel_execve()
    functions, and switches x86, arm, alpha, um and s390 over to them."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
    s390: convert to generic kernel_execve()
    s390: switch to generic kernel_thread()
    s390: fold kernel_thread_helper() into ret_from_fork()
    s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
    um: switch to generic kernel_thread()
    x86, um/x86: switch to generic sys_execve and kernel_execve
    x86: split ret_from_fork
    alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    alpha: switch to generic kernel_thread()
    alpha: switch to generic sys_execve()
    arm: get rid of execve wrapper, switch to generic execve() implementation
    arm: optimized current_pt_regs()
    arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
    generic sys_execve()
    generic kernel_execve()
    new helper: current_pt_regs()
    preparation for generic kernel_thread()
    um: kill thread->forking
    um: let signal_delivered() do SIGTRAP on singlestepping into handler
    ...

    Linus Torvalds
     

09 Oct, 2012

2 commits

  • During mremap(), the destination VMA is generally placed after the
    original vma in rmap traversal order: in move_vma(), we always have
    new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
    vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.

    When the destination VMA is placed after the original in rmap traversal
    order, we can avoid taking the rmap locks in move_ptes().

    Essentially, this reintroduces the optimization that had been disabled in
    "mm anon rmap: remove anon_vma_moveto_tail". The difference is that we
    don't try to impose the rmap traversal order; instead we just rely on
    things being in the desired order in the common case and fall back to
    taking locks in the uncommon case. Also we skip the i_mmap_mutex in
    addition to the anon_vma lock: in both cases, the vmas are traversed in
    increasing vm_pgoff order with ties resolved in tree insertion order.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while
    waiting for other threads. The only complication is that we should
    clear ->group_exit_task and ->notify_count before we return, and we
    should do this under tasklist_lock. -EAGAIN is used to match the
    initial signal_group_exit() check/return, it doesn't really matter.

    This fixes the (unlikely) race with coredump. de_thread() checks
    signal_group_exit() before it starts to kill the subthreads, but this
    can't help if another CLONE_VM (but non CLONE_THREAD) task starts the
    coredumping after de_thread() unlocks ->siglock. In this case the
    killed sub-thread can block in exit_mm() waiting for coredump_finish(),
    execing thread waits for that sub-thead, and the coredumping thread
    waits for execing thread. Deadlock.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 Oct, 2012

2 commits

  • Cosmetic. Change setup_new_exec() and task_dumpable() to use
    SUID_DUMPABLE_ENABLED for /bin/grep.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Create a new header file, fs/coredump.h, which contains functions only
    used by the new coredump.c. It also moves do_coredump to the
    include/linux/coredump.h header file, for consistency.

    Signed-off-by: Alex Kelly
    Reviewed-by: Josh Triplett
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Kelly
     

03 Oct, 2012

1 commit

  • This prepares for making core dump functionality optional.

    The variable "suid_dumpable" and associated functions are left in fs/exec.c
    because they're used elsewhere, such as in ptrace.

    Signed-off-by: Alex Kelly
    Reviewed-by: Josh Triplett
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Alex Kelly
     

01 Oct, 2012

2 commits

  • Selected by __ARCH_WANT_SYS_EXECVE in unistd.h. Requires
    * working current_pt_regs()
    * *NOT* doing a syscall-in-kernel kind of kernel_execve()
    implementation. Using generic kernel_execve() is fine.

    Signed-off-by: Al Viro

    Al Viro
     
  • based mostly on arm and alpha versions. Architectures can define
    __ARCH_WANT_KERNEL_EXECVE and use it, provided that
    * they have working current_pt_regs(), even for kernel threads.
    * kernel_thread-spawned threads do have space for pt_regs
    in the normal location. Normally that's as simple as switching to
    generic kernel_thread() and making sure that kernel threads do *not*
    go through return from syscall path; call the payload from equivalent
    of ret_from_fork if we are in a kernel thread (or just have separate
    ret_from_kernel_thread and make copy_thread() use it instead of
    ret_from_fork in kernel thread case).
    * they have ret_from_kernel_execve(); it is called after
    successful do_execve() done by kernel_execve() and gets normal
    pt_regs location passed to it as argument. It's essentially
    a longjmp() analog - it should set sp, etc. to the situation
    expected at the return for syscall and go there. Eventually
    the need for that sucker will disappear, but that'll take some
    surgery on kernel_thread() payloads.

    Signed-off-by: Al Viro

    Al Viro
     

27 Sep, 2012

3 commits


20 Sep, 2012

1 commit


02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

3 commits

  • In commit 898b374af6f7 ("exec: replace call_usermodehelper_pipe with use
    of umh init function and resolve limit"), the core limits recursive
    check value was changed from 0 to 1, but the corresponding comments were
    not updated.

    Signed-off-by: Jovi Zhang
    Cc: Oleg Nesterov
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jovi Zhang
     
  • When suid_dumpable=2, detect unsafe core_pattern settings and warn when
    they are seen.

    Signed-off-by: Kees Cook
    Suggested-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • When the suid_dumpable sysctl is set to "2", and there is no core dump
    pipe defined in the core_pattern sysctl, a local user can cause core files
    to be written to root-writable directories, potentially with
    user-controlled content.

    This means an admin can unknowningly reintroduce a variation of
    CVE-2006-2451, allowing local users to gain root privileges.

    $ cat /proc/sys/fs/suid_dumpable
    2
    $ cat /proc/sys/kernel/core_pattern
    core
    $ ulimit -c unlimited
    $ cd /
    $ ls -l core
    ls: cannot access core: No such file or directory
    $ touch core
    touch: cannot touch `core': Permission denied
    $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
    $ pid=$!
    $ sleep 1
    $ kill -SEGV $pid
    $ ls -l core
    -rw------- 1 root kees 458752 Jun 21 11:35 core
    $ sudo strings core | grep evil
    OHAI=evil-string-here

    While cron has been fixed to abort reading a file when there is any
    parse error, there are still other sensitive directories that will read
    any file present and skip unparsable lines.

    Instead of introducing a suid_dumpable=3 mode and breaking all users of
    mode 2, this only disables the unsafe portion of mode 2 (writing to disk
    via relative path). Most users of mode 2 (e.g. Chrome OS) already use
    a core dump pipe handler, so this change will not break them. For the
    situations where a pipe handler is not defined but mode 2 is still
    active, crash dumps will only be written to fully qualified paths. If a
    relative path is defined (e.g. the default "core" pattern), dump
    attempts will trigger a printk yelling about the lack of a fully
    qualified path.

    Signed-off-by: Kees Cook
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Jul, 2012

1 commit


27 Jul, 2012

1 commit

  • Recently, glibc made a change to suppress sign-conversion warnings in
    FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the
    kernel's definition of __NFDBITS if applications #include
    after including . A build failure would
    be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
    flags to gcc.

    It was suggested that the kernel should either match the glibc
    definition of __NFDBITS or remove that entirely. The current in-kernel
    uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
    uses of the related __FDELT and __FDMASK defines. Given that, we'll
    continue the cleanup that was started with commit 8b3d1cda4f5f
    ("posix_types: Remove fd_set macros") and drop the remaining unused
    macros.

    Additionally, linux/time.h has similar macros defined that expand to
    nothing so we'll remove those at the same time.

    Reported-by: Jeff Law
    Suggested-by: Linus Torvalds
    CC:
    Signed-off-by: Josh Boyer
    [ .. and fix up whitespace as per akpm ]
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

21 Jun, 2012

1 commit

  • do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
    put_user(clear_child_tid) which can update task->rss_stat and thus make
    mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm().

    Let's fix this bug in the safest way, and optimize/cleanup this later.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Jun, 2012

2 commits

  • This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

    It's horribly and utterly broken for at least the following reasons:

    - calling sync_mm_rss() from mmput() is fundamentally wrong, because
    there's absolutely no reason to believe that the task that does the
    mmput() always does it on its own VM. Example: fork, ptrace, /proc -
    you name it.

    - calling it *after* having done mmdrop() on it is doubly insane, since
    the mm struct may well be gone now.

    - testing mm against NULL before you call it is insane too, since a
    NULL mm there would have caused oopses long before.

    .. and those are just the three bugs I found before I decided to give up
    looking for me and revert it asap. I should have caught it before I
    even took it, but I trusted Andrew too much.

    Cc: Konstantin Khlebnikov
    Cc: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • mm->rss_stat counters have per-task delta: task->rss_stat. Before
    changing task->mm pointer the kernel must flush this delta with
    sync_mm_rss().

    do_exit() already calls sync_mm_rss() to flush the rss-counters before
    committing the rss statistics into task->signal->maxrss, taskstats,
    audit and other stuff. Unfortunately the kernel does this before
    calling mm_release(), which can call put_user() for processing
    task->clear_child_tid. So at this point we can trigger page-faults and
    task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
    inconsistent and check_mm() will print something like this:

    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
    | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

    This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
    out of do_exit() and calls it earlier. After mm_release() there should
    be no pagefaults.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Markus Trippelsdorf
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: [3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

01 Jun, 2012

1 commit


24 May, 2012

2 commits

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     
  • Pull fpu state cleanups from Ingo Molnar:
    "This tree streamlines further aspects of FPU handling by eliminating
    the prepare_to_copy() complication and moving that logic to
    arch_dup_task_struct().

    It also fixes the FPU dumps in threaded core dumps, removes and old
    (and now invalid) assumption plus micro-optimizes the exit path by
    avoiding an FPU save for dead tasks."

    Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
    in because we now do the FPU handling in arch_dup_task_struct() rather
    than the legacy (and now gone) prepare_to_copy().

    * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, fpu: drop the fpu state during thread exit
    x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
    coredump: ensure the fpu state is flushed for proper multi-threaded core dump
    fork: move the real prepare_to_copy() users to arch_dup_task_struct()

    Linus Torvalds