13 Aug, 2020

5 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • The path_noexec() check, like the regular file check, was happening too
    late, letting LSMs see impossible execve()s. Check it earlier as well in
    may_open() and collect the redundant fs/exec.c path_noexec() test under
    the same robustness comment as the S_ISREG() check.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs path_noexec() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    security_file_open(f)
    open()
    /* old location of path_noexec() test */

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Aleksa Sarai
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "Relocate execve() sanity checks", v2.

    While looking at the code paths for the proposed O_MAYEXEC flag, I saw
    some things that looked like they should be fixed up.

    exec: Change uselib(2) IS_SREG() failure to EACCES
    This just regularizes the return code on uselib(2).

    exec: Move S_ISREG() check earlier
    This moves the S_ISREG() check even earlier than it was already.

    exec: Move path_noexec() check earlier
    This adds the path_noexec() check to the same place as the
    S_ISREG() check.

    This patch (of 3):

    Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
    the behavior matches execve(2), and the seemingly documented value. The
    "not a regular file" failure mode of execve(2) is explicitly
    documented[1], but it is not mentioned in uselib(2)[2] which does,
    however, say that open(2) and mmap(2) errors may apply. The documentation
    for open(2) does not include a "not a regular file" error[3], but mmap(2)
    does[4], and it is EACCES.

    [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
    [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
    [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
    [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org
    Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Both exec and exit want to ensure that the uaccess routines actually do
    access user pointers. Use the newly added force_uaccess_begin helper
    instead of an open coded set_fs for that to prepare for kernel builds
    where set_fs() does not exist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Nick Hu
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

21 Jul, 2020

6 commits

  • To allow the kernel not to play games with set_fs to call exec
    implement kernel_execve. The function kernel_execve takes pointers
    into kernel memory and copies the values pointed to onto the new
    userspace stack.

    The calls with arguments from kernel space of do_execve are replaced
    with calls to kernel_execve.

    The calls do_execve and do_execveat are made static as there are now
    no callers outside of exec.

    The comments that mention do_execve are updated to refer to
    kernel_execve or execve depending on the circumstances. In addition
    to correcting the comments, this makes it easy to grep for do_execve
    and verify it is not used.

    Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for implementiong kernel_execve (which will take kernel
    pointers not userspace pointers) factor out bprm_stack_limits out of
    prepare_arg_pages. This separates the counting which depends upon the
    getting data from userspace from the calculations of the stack limits
    which is usable in kernel_execve.

    The remove prepare_args_pages and compute bprm->argc and bprm->envc
    directly in do_execveat_common, before bprm_stack_limits is called.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Factor bprm_execve
    out of do_execve_common to separate out the copying of arguments
    to the newe stack, and the rest of exec.

    In separating bprm_execve from do_execve_common the copying
    of the arguments onto the new stack happens earlier.

    As the copying of the arguments does not depend any security hooks,
    files, the file table, current->in_execve, current->fs->in_exec,
    bprm->unsafe, or creds this is safe.

    Likewise the security hook security_creds_for_exec does not depend upon
    preventing the argument copying from happening.

    In addition to making it possible to implement kernel_execve that
    performs the copying differently, this separation of bprm_execve from
    do_execve_common makes for a nice separation of responsibilities making
    the exec code easier to navigate.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code that
    launches init to use set_fs so that pages coming from the kernel look like
    they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument copying
    from userspace needs to happen earlier. Move the allocation and
    initialization of bprm->mm into alloc_bprm so that the bprm->mm is
    available early to store the new user stack into. This is a prerequisite
    for copying argv and envp into the new user stack early before ther rest of
    exec.

    To keep the things consistent the cleanup of bprm->mm is moved into
    free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
    allocated and free_bprm are called.

    Moving bprm_mm_init earlier is safe as it does not depend on any files,
    current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
    table is shared. (AKA bprm_mm_init does not depend on any of the code that
    happens between alloc_bprm and where it was previously called.)

    This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
    is safe because current->fs->in_exec is only used to preventy taking an
    additional reference on the fs_struct.

    This moves bprm->mm cleanup after current->in_execve is set to 0. This is
    safe because current->in_execve is only used by the lsms (apparmor and
    tomoyou) and always for LSM specific functions, never for anything to do
    with the mm.

    This adds bprm->mm cleanup into the successful return path. This is safe
    because being on the successful return path implies that begin_new_exec
    succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
    moving into free_bprm will do nothing.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the computation
    of bprm->filename and possible allocation of a name in the case
    of execveat into alloc_bprm to make that possible.

    The exectuable name, the arguments, and the environment are
    copied into the new usermode stack which is stored in bprm
    until exec passes the point of no return.

    As the executable name is copied first onto the usermode stack
    it needs to be known. As there are no dependencies to computing
    the executable name, compute it early in alloc_bprm.

    As an implementation detail if the filename needs to be generated
    because it embeds a file descriptor store that filename in a new field
    bprm->fdpath, and free it in free_bprm. Previously this was done in
    an independent variable pathbuf. I have renamed pathbuf fdpath
    because fdpath is more suggestive of what kind of path is in the
    variable. I moved fdpath into struct linux_binprm because it is
    tightly tied to the other variables in struct linux_binprm, and as
    such is needed to allow the call alloc_binprm to move.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the allocation
    of the bprm into it's own function (alloc_bprm) and move the call of
    alloc_bprm before unshare_files so that bprm can ultimately be
    allocated, the arguments can be placed on the new stack, and then the
    bprm can be passed into the core of exec.

    Neither the allocation of struct binprm nor the unsharing depend upon each
    other so swapping the order in which they are called is trivially safe.

    To keep things consistent the order of cleanup at the end of
    do_execve_common swapped to match the order of initialization.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2020

1 commit

  • Now that the last callser has been removed remove this code from exec.

    For anyone thinking of resurrecing do_execve_file please note that
    the code was buggy in several fundamental ways.

    - It did not ensure the file it was passed was read-only and that
    deny_write_access had been called on it. Which subtlely breaks
    invaniants in exec.

    - The caller of do_execve_file was expected to hold and put a
    reference to the file, but an extra reference for use by exec was
    not taken so that when exec put it's reference to the file an
    underflow occured on the file reference count.

    - The point of the interface was so that a pathname did not need to
    exist. Which breaks pathname based LSMs.

    Tetsuo Handa originally reported these issues[1]. While it was clear
    that deny_write_access was missing the fundamental incompatibility
    with the passed in O_RDWR filehandle was not immediately recognized.

    All of these issues were fixed by modifying the usermode driver code
    to have a path, so it did not need this hack.

    Reported-by: Tetsuo Handa
    [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
    v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

2 commits


05 Jun, 2020

5 commits

  • Merge yet more updates from Andrew Morton:

    - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
    __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

    - Various other little subsystems

    * emailed patches from Andrew Morton : (127 commits)
    lib/ubsan.c: fix gcc-10 warnings
    tools/testing/selftests/vm: remove duplicate headers
    selftests: vm: pkeys: fix multilib builds for x86
    selftests: vm: pkeys: use the correct page size on powerpc
    selftests/vm/pkeys: override access right definitions on powerpc
    selftests/vm/pkeys: test correct behaviour of pkey-0
    selftests/vm/pkeys: introduce a sub-page allocator
    selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
    selftests/vm/pkeys: associate key on a mapped page and detect write violation
    selftests/vm/pkeys: associate key on a mapped page and detect access violation
    selftests/vm/pkeys: improve checks to determine pkey support
    selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
    selftests/vm/pkeys: fix number of reserved powerpc pkeys
    selftests/vm/pkeys: introduce powerpc support
    selftests/vm/pkeys: introduce generic pkey abstractions
    selftests: vm: pkeys: use the correct huge page size
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
    selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
    selftests/vm/pkeys: fix pkey_disable_clear()
    selftests: vm: pkeys: add helpers for pkey bits
    ...

    Linus Torvalds
     
  • Currently copy_string_kernel is just a wrapper around copy_strings that
    simplifies the calling conventions and uses set_fs to allow passing a
    kernel pointer. But due to the fact the we only need to handle a single
    kernel argument pointer, the logic can be sigificantly simplified while
    getting rid of the set_fs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • copy_strings_kernel is always used with a single argument,
    adjust the calling convention to that.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Pull execve updates from Eric Biederman:
    "Last cycle for the Nth time I ran into bugs and quality of
    implementation issues related to exec that could not be easily be
    fixed because of the way exec is implemented. So I have been digging
    into exec and cleanup up what I can.

    I don't think I have exec sorted out enough to fix the issues I
    started with but I have made some headway this cycle with 4 sets of
    changes.

    - promised cleanups after introducing exec_update_mutex

    - trivial cleanups for exec

    - control flow simplifications

    - remove the recomputation of bprm->cred

    The net result is code that is a bit easier to understand and work
    with and a decrease in the number of lines of code (if you don't count
    the added tests)"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
    exec: Compute file based creds only once
    exec: Add a per bprm->file version of per_clear
    binfmt_elf_fdpic: fix execfd build regression
    selftests/exec: Add binfmt_script regression test
    exec: Remove recursion from search_binary_handler
    exec: Generic execfd support
    exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
    exec: Move the call of prepare_binprm into search_binary_handler
    exec: Allow load_misc_binary to call prepare_binprm unconditionally
    exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
    exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
    exec: Teach prepare_exec_creds how exec treats uids & gids
    exec: Set the point of no return sooner
    exec: Move handling of the point of no return to the top level
    exec: Run sync_mm_rss before taking exec_update_mutex
    exec: Fix spelling of search_binary_handler in a comment
    exec: Move the comment from above de_thread to above unshare_sighand
    exec: Rename flush_old_exec begin_new_exec
    exec: Move most of setup_new_exec into flush_old_exec
    exec: In setup_new_exec cache current in the local variable me
    ...

    Linus Torvalds
     
  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     

30 May, 2020

2 commits

  • Move the computation of creds from prepare_binfmt into begin_new_exec
    so that the creds need only be computed once. This is just code
    reorganization no semantic changes of any kind are made.

    Moving the computation is safe. I have looked through the kernel and
    verified none of the binfmts look at bprm->cred directly, and that
    there are no helpers that look at bprm->cred indirectly. Which means
    that it is not a problem to compute the bprm->cred later in the
    execution flow as it is not used until it becomes current->cred.

    A new function bprm_creds_from_file is added to contain the work that
    needs to be done. bprm_creds_from_file first computes which file
    bprm->executable or most likely bprm->file that the bprm->creds
    will be computed from.

    The funciton bprm_fill_uid is updated to receive the file instead of
    accessing bprm->file. The now unnecessary work needed to reset the
    bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
    A small comment to document that bprm_fill_uid now only deals with the
    work to handle suid and sgid files. The default case is already
    heandled by prepare_exec_creds.

    The function security_bprm_repopulate_creds is renamed
    security_bprm_creds_from_file and now is explicitly passed the file
    from which to compute the creds. The documentation of the
    bprm_creds_from_file security hook is updated to explain when the hook
    is called and what it needs to do. The file is passed from
    cap_bprm_creds_from_file into get_file_caps so that the caps are
    computed for the appropriate file. The now unnecessary work in
    cap_bprm_creds_from_file to reset the ambient capabilites has been
    removed. A small comment to document that the work of
    cap_bprm_creds_from_file is to read capabilities from the files
    secureity attribute and derive capabilities from the fact the
    user had uid 0 has been added.

    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • There is a small bug in the code that recomputes parts of bprm->cred
    for every bprm->file. The code never recomputes the part of
    clear_dangerous_personality_flags it is responsible for.

    Which means that in practice if someone creates a sgid script
    the interpreter will not be able to use any of:
    READ_IMPLIES_EXEC
    ADDR_NO_RANDOMIZE
    ADDR_COMPAT_LAYOUT
    MMAP_PAGE_ZERO.

    This accentially clearing of personality flags probably does
    not matter in practice because no one has complained
    but it does make the code more difficult to understand.

    Further remaining bug compatible prevents the recomputation from being
    removed and replaced by simply computing bprm->cred once from the
    final bprm->file.

    Making this change removes the last behavior difference between
    computing bprm->creds from the final file and recomputing
    bprm->cred several times. Which allows this behavior change
    to be justified for it's own reasons, and for any but hunts
    looking into why the behavior changed to wind up here instead
    of in the code that will follow that computes bprm->cred
    from the final bprm->file.

    This small logic bug appears to have existed since the code
    started clearing dangerous personality bits.

    History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 May, 2020

6 commits

  • Recursion in kernel code is generally a bad idea as it can overflow
    the kernel stack. Recursion in exec also hides that the code is
    looping and that the loop changes bprm->file.

    Instead of recursing in search_binary_handler have the methods that
    would recurse set bprm->interpreter and return 0. Modify exec_binprm
    to loop when bprm->interpreter is set. Consolidate all of the
    reassignments of bprm->file in that loop to make it clear what is
    going on.

    The structure of the new loop in exec_binprm is that all errors return
    immediately, while successful completion (ret == 0 &&
    !bprm->interpreter) just breaks out of the loop and runs what
    exec_bprm has always run upon successful completion.

    Fail if the an interpreter is being call after execfd has been set.
    The code has never properly handled an interpreter being called with
    execfd being set and with reassignments of bprm->file and the
    assignment of bprm->executable in generic code it has finally become
    possible to test and fail when if this problematic condition happens.

    With the reassignments of bprm->file and the assignment of
    bprm->executable moved into the generic code add a test to see if
    bprm->executable is being reassigned.

    In search_binary_handler remove the test for !bprm->file. With all
    reassignments of bprm->file moved to exec_binprm bprm->file can never
    be NULL in search_binary_handler.

    Link: https://lkml.kernel.org/r/87sgfwyd84.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Most of the support for passing the file descriptor of an executable
    to an interpreter already lives in the generic code and in binfmt_elf.
    Rework the fields in binfmt_elf that deal with executable file
    descriptor passing to make executable file descriptor passing a first
    class concept.

    Move the fd_install from binfmt_misc into begin_new_exec after the new
    creds have been installed. This means that accessing the file through
    /proc//fd/N is able to see the creds for the new executable
    before allowing access to the new executables files.

    Performing the install of the executables file descriptor after
    the point of no return also means that nothing special needs to
    be done on error. The exiting of the process will close all
    of it's open files.

    Move the would_dump from binfmt_misc into begin_new_exec right
    after would_dump is called on the bprm->file. This makes it
    obvious this case exists and that no nesting of bprm->file is
    currently supported.

    In binfmt_misc the movement of fd_install into generic code means
    that it's special error exit path is no longer needed.

    Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The code in prepare_binary_handler needs to be run every time
    search_binary_handler is called so move the call into search_binary_handler
    itself to make the code simpler and easier to understand.

    Link: https://lkml.kernel.org/r/87d070zrvx.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Reviewed-by: James Morris
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Add a flag preserve_creds that binfmt_misc can set to prevent
    credentials from being updated. This allows binfmt_misc to always
    call prepare_binprm. Allowing the credential computation logic to be
    consolidated.

    Not replacing the credentials with the interpreters credentials is
    safe because because an open file descriptor to the executable is
    passed to the interpreter. As the interpreter does not need to
    reopen the executable it is guaranteed to see the same file that
    exec sees.

    Ref: c407c033de84 ("[PATCH] binfmt_misc: improve calculation of interpreter's credentials")
    Link: https://lkml.kernel.org/r/87imgszrwo.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Rename bprm->cap_elevated to bprm->active_secureexec and initialize it
    in prepare_binprm instead of in cap_bprm_set_creds. Initializing
    bprm->active_secureexec in prepare_binprm allows multiple
    implementations of security_bprm_repopulate_creds to play nicely with
    each other.

    Rename security_bprm_set_creds to security_bprm_reopulate_creds to
    emphasize that this path recomputes part of bprm->cred. This
    recomputation avoids the time of check vs time of use problems that
    are inherent in unix #! interpreters.

    In short two renames and a move in the location of initializing
    bprm->active_secureexec.

    Link: https://lkml.kernel.org/r/87o8qkzrxp.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Today security_bprm_set_creds has several implementations:
    apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
    smack_bprm_set_creds, and tomoyo_bprm_set_creds.

    Except for cap_bprm_set_creds they all test bprm->called_set_creds and
    return immediately if it is true. The function cap_bprm_set_creds
    ignores bprm->calld_sed_creds entirely.

    Create a new LSM hook security_bprm_creds_for_exec that is called just
    before prepare_binprm in __do_execve_file, resulting in a LSM hook
    that is called exactly once for the entire of exec. Modify the bits
    of security_bprm_set_creds that only want to be called once per exec
    into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
    behind.

    Remove bprm->called_set_creds all of it's former users have been moved
    to security_bprm_creds_for_exec.

    Add or upate comments a appropriate to bring them up to date and
    to reflect this change.

    Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Acked-by: Casey Schaufler # For the LSM and Smack bits
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 May, 2020

1 commit


17 May, 2020

1 commit

  • I goofed when I added mm->user_ns support to would_dump. I missed the
    fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
    binfmt_script bprm->file is reassigned. Which made the move of
    would_dump from setup_new_exec to __do_execve_file before exec_binprm
    incorrect as it can result in would_dump running on the script instead
    of the interpreter of the script.

    The net result is that the code stopped making unreadable interpreters
    undumpable. Which allows them to be ptraced and written to disk
    without special permissions. Oops.

    The move was necessary because the call in set_new_exec was after
    bprm->mm was no longer valid.

    To correct this mistake move the misplaced would_dump from
    __do_execve_file into flos_old_exec, before exec_mmap is called.

    I tested and confirmed that without this fix I can attach with gdb to
    a script with an unreadable interpreter, and with this fix I can not.

    Cc: stable@vger.kernel.org
    Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 May, 2020

3 commits

  • Make the code more robust by marking the point of no return sooner.
    This ensures that future code changes don't need to worry about how
    they return errors if they are past this point.

    This results in no actual change in behavior as __do_execve_file does
    not force SIGSEGV when there is a pending fatal signal pending past
    the point of no return. Further the only error returns from de_thread
    and exec_mmap that can occur result in fatal signals being pending.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87sgga5klu.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Move the handing of the point of no return from search_binary_handler
    into __do_execve_file so that it is easier to find, and to keep
    things robust in the face of change.

    Make it clear that an existing fatal signal will take precedence over
    a forced SIGSEGV by not forcing SIGSEGV if a fatal signal is already
    pending. This does not change the behavior but it saves a reader
    of the code the tedium of reading and understanding force_sig
    and the signal delivery code.

    Update the comment in begin_new_exec about where SIGSEGV is forced.

    Keep point_of_no_return from being a mystery by documenting
    what the code is doing where it forces SIGSEGV if the
    code is past the point of no return.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87y2q25knl.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Like exec_mm_release sync_mm_rss is about flushing out the state of
    the old_mm, which does not need to happen under exec_update_mutex.

    Make this explicit by moving sync_mm_rss outside of exec_update_mutex.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/875zd66za3.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 May, 2020

2 commits


08 May, 2020

4 commits

  • There is and has been for a very long time been a lot more going on in
    flush_old_exec than just flushing the old state. After the movement
    of code from setup_new_exec there is a whole lot more going on than
    just flushing the old executables state.

    Rename flush_old_exec to begin_new_exec to more accurately reflect
    what this function does.

    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The current idiom for the callers is:

    flush_old_exec(bprm);
    set_personality(...);
    setup_new_exec(bprm);

    In 2010 Linus split flush_old_exec into flush_old_exec and
    setup_new_exec. With the intention that setup_new_exec be what is
    called after the processes new personality is set.

    Move the code that doesn't depend upon the personality from
    setup_new_exec into flush_old_exec. This is to facilitate future
    changes by having as much code together in one function as possible.

    To see why it is safe to move this code please note that effectively
    this change moves the personality setting in the binfmt and the following
    three lines of code after everything except unlocking the mutexes:
    arch_pick_mmap_layout
    arch_setup_new_exec
    mm->task_size = TASK_SIZE

    The function arch_pick_mmap_layout at most sets:
    mm->get_unmapped_area
    mm->mmap_base
    mm->mmap_legacy_base
    mm->mmap_compat_base
    mm->mmap_compat_legacy_base
    which nothing in flush_old_exec or setup_new_exec depends on.

    The function arch_setup_new_exec only sets architecture specific
    state and the rest of the functions only deal in state that applies
    to all architectures.

    The last line just sets mm->task_size and again nothing in flush_old_exec
    or setup_new_exec depend on task_size.

    Ref: 221af7f87b97 ("Split 'flush_old_exec' into two functions")
    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • At least gcc 8.3 when generating code for x86_64 has a hard time
    consolidating multiple calls to current aka get_current(), and winds
    up unnecessarily rereading %gs:current_task several times in
    setup_new_exec.

    Caching the value of current in the local variable of me generates
    slightly better and shorter assembly.

    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The two functions are now always called one right after the
    other so merge them together to make future maintenance easier.

    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman