09 Jan, 2021

1 commit

  • [ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

    Recently syzbot reported[0] that there is a deadlock amongst the users
    of exec_update_mutex. The problematic lock ordering found by lockdep
    was:

    perf_event_open (exec_update_mutex -> ovl_i_mutex)
    chown (ovl_i_mutex -> sb_writes)
    sendfile (sb_writes -> p->lock)
    by reading from a proc file and writing to overlayfs
    proc_pid_syscall (p->lock -> exec_update_mutex)

    While looking at possible solutions it occured to me that all of the
    users and possible users involved only wanted to state of the given
    process to remain the same. They are all readers. The only writer is
    exec.

    There is no reason for readers to block on each other. So fix
    this deadlock by transforming exec_update_mutex into a rw_semaphore
    named exec_update_lock that only exec takes for writing.

    Cc: Jann Horn
    Cc: Vasiliy Kulikov
    Cc: Al Viro
    Cc: Bernd Edlinger
    Cc: Oleg Nesterov
    Cc: Christopher Yeoh
    Cc: Cyrill Gorcunov
    Cc: Sargun Dhillon
    Cc: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
    [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     

17 Oct, 2020

1 commit

  • Pull powerpc updates from Michael Ellerman:

    - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
    it for powerpc, as well as a related fix for sparc.

    - Remove support for PowerPC 601.

    - Some fixes for watchpoints & addition of a new ptrace flag for
    detecting ISA v3.1 (Power10) watchpoint features.

    - A fix for kernels using 4K pages and the hash MMU on bare metal
    Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

    - A basic idle driver for shallow stop states on Power10.

    - Tweaks to our sched domains code to better inform the scheduler about
    the hardware topology on Power9/10, where two SMT4 cores can be
    presented by firmware as an SMT8 core.

    - A series doing further reworks & cleanups of our EEH code.

    - Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
    to prevent root from overwriting kernel memory.

    - Other smaller features, fixes & cleanups.

    Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
    Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
    Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
    Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
    Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
    Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
    Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
    Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
    Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
    Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
    Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
    Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
    Yingliang, zhengbin.

    * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
    Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
    selftests/powerpc: Fix eeh-basic.sh exit codes
    cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
    powerpc/time: Make get_tb() common to PPC32 and PPC64
    powerpc/time: Make get_tbl() common to PPC32 and PPC64
    powerpc/time: Remove get_tbu()
    powerpc/time: Avoid using get_tbl() and get_tbu() internally
    powerpc/time: Make mftb() common to PPC32 and PPC64
    powerpc/time: Rename mftbl() to mftb()
    powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
    powerpc/32s: Rename head_32.S to head_book3s_32.S
    powerpc/32s: Setup the early hash table at all time.
    powerpc/time: Remove ifdef in get_dec() and set_dec()
    powerpc: Remove get_tb_or_rtc()
    powerpc: Remove __USE_RTC()
    powerpc: Tidy up a bit after removal of PowerPC 601.
    powerpc: Remove support for PowerPC 601
    powerpc: Remove PowerPC 601
    powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
    powerpc: Remove CONFIG_PPC601_SYNC_FIX
    ...

    Linus Torvalds
     

16 Oct, 2020

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char, misc, and other assorted driver subsystem
    patches for 5.10-rc1.

    There's a lot of different things in here, all over the drivers/
    directory. Some summaries:

    - soundwire driver updates

    - habanalabs driver updates

    - extcon driver updates

    - nitro_enclaves new driver

    - fsl-mc driver and core updates

    - mhi core and bus updates

    - nvmem driver updates

    - eeprom driver updates

    - binder driver updates and fixes

    - vbox minor bugfixes

    - fsi driver updates

    - w1 driver updates

    - coresight driver updates

    - interconnect driver updates

    - misc driver updates

    - other minor driver updates

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
    binder: fix UAF when releasing todo list
    docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
    misc: Kconfig: fix a HISI_HIKEY_USB dependency
    LSM: Fix type of id parameter in kernel_post_load_data prototype
    misc: Kconfig: add a new dependency for HISI_HIKEY_USB
    firmware_loader: fix a kernel-doc markup
    w1: w1_therm: make w1_poll_completion static
    binder: simplify the return expression of binder_mmap
    test_firmware: Test partial read support
    firmware: Add request_partial_firmware_into_buf()
    firmware: Store opt_flags in fw_priv
    fs/kernel_file_read: Add "offset" arg for partial reads
    IMA: Add support for file reads without contents
    LSM: Add "contents" flag to kernel_read_file hook
    module: Call security_kernel_post_load_data()
    firmware_loader: Use security_post_load_data()
    LSM: Introduce kernel_post_load_data() hook
    fs/kernel_read_file: Add file_size output argument
    fs/kernel_read_file: Switch buffer size arg to size_t
    fs/kernel_read_file: Remove redundant size argument
    ...

    Linus Torvalds
     

05 Oct, 2020

3 commits

  • These routines are used in places outside of exec(2), so in preparation
    for refactoring them, move them into a separate source file,
    fs/kernel_read_file.c.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Scott Branden
    Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h
    include file. That header gets pulled in just about everywhere
    and doesn't really need functions not related to the general fs interface.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Scott Branden
    Signed-off-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Greg Kroah-Hartman
    Acked-by: James Morris
    Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com
    Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Scott Branden
     
  • FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
    that are interested in filtering between types of things. The "how"
    should be an internal detail made uninteresting to the LSMs.

    Fixes: a098ecd2fa7d ("firmware: support loading into a pre-allocated buffer")
    Fixes: fd90bc559bfb ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
    Fixes: 4f0496d8ffa3 ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Scott Branden
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

01 Oct, 2020

1 commit

  • Grab actual references to the files_struct. To avoid circular references
    issues due to this, we add a per-task note that keeps track of what
    io_uring contexts a task has used. When the tasks execs or exits its
    assigned files, we cancel requests based on this tracking.

    With that, we can grab proper references to the files table, and no
    longer need to rely on stashing away ring_fd and ring_file to check
    if the ring_fd may have been closed.

    Cc: stable@vger.kernel.org # v5.5+
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Sep, 2020

1 commit

  • Reading and modifying current->mm and current->active_mm and switching
    mm should be done with irqs off, to prevent races seeing an intermediate
    state.

    This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
    invalidate"). At exec-time when the new mm is activated, the old one
    should usually be single-threaded and no longer used, unless something
    else is holding an mm_users reference (which may be possible).

    Absent other mm_users, there is also a race with preemption and lazy tlb
    switching. Consider the kernel_execve case where the current thread is
    using a lazy tlb active mm:

    call_usermodehelper()
    kernel_execve()
    old_mm = current->mm;
    active_mm = current->active_mm;
    *** preempt *** --------------------> schedule()
    prev->active_mm = NULL;
    mmdrop(prev active_mm);
    ...
    mm = mm;
    current->active_mm = mm;
    if (!old_mm)
    mmdrop(active_mm);

    If we switch back to the kernel thread from a different mm, there is a
    double free of the old active_mm, and a missing free of the new one.

    Closing this race only requires interrupts to be disabled while ->mm
    and ->active_mm are being switched, but the TLB problem requires also
    holding interrupts off over activate_mm. Unfortunately not all archs
    can do that yet, e.g., arm defers the switch if irqs are disabled and
    expects finish_arch_post_lock_switch() to be called to complete the
    flush; um takes a blocking lock in activate_mm().

    So as a first step, disable interrupts across the mm/active_mm updates
    to close the lazy tlb preempt race, and provide an arch option to
    extend that to activate_mm which allows architectures doing IPI based
    TLB shootdowns to close the second race.

    This is a bit ugly, but in the interest of fixing the bug and backporting
    before all architectures are converted this is a compromise.

    Signed-off-by: Nicholas Piggin
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com

    Nicholas Piggin
     

13 Aug, 2020

5 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • The path_noexec() check, like the regular file check, was happening too
    late, letting LSMs see impossible execve()s. Check it earlier as well in
    may_open() and collect the redundant fs/exec.c path_noexec() test under
    the same robustness comment as the S_ISREG() check.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs path_noexec() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    security_file_open(f)
    open()
    /* old location of path_noexec() test */

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Aleksa Sarai
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "Relocate execve() sanity checks", v2.

    While looking at the code paths for the proposed O_MAYEXEC flag, I saw
    some things that looked like they should be fixed up.

    exec: Change uselib(2) IS_SREG() failure to EACCES
    This just regularizes the return code on uselib(2).

    exec: Move S_ISREG() check earlier
    This moves the S_ISREG() check even earlier than it was already.

    exec: Move path_noexec() check earlier
    This adds the path_noexec() check to the same place as the
    S_ISREG() check.

    This patch (of 3):

    Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
    the behavior matches execve(2), and the seemingly documented value. The
    "not a regular file" failure mode of execve(2) is explicitly
    documented[1], but it is not mentioned in uselib(2)[2] which does,
    however, say that open(2) and mmap(2) errors may apply. The documentation
    for open(2) does not include a "not a regular file" error[3], but mmap(2)
    does[4], and it is EACCES.

    [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
    [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
    [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
    [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org
    Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Both exec and exit want to ensure that the uaccess routines actually do
    access user pointers. Use the newly added force_uaccess_begin helper
    instead of an open coded set_fs for that to prepare for kernel builds
    where set_fs() does not exist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Nick Hu
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

21 Jul, 2020

6 commits

  • To allow the kernel not to play games with set_fs to call exec
    implement kernel_execve. The function kernel_execve takes pointers
    into kernel memory and copies the values pointed to onto the new
    userspace stack.

    The calls with arguments from kernel space of do_execve are replaced
    with calls to kernel_execve.

    The calls do_execve and do_execveat are made static as there are now
    no callers outside of exec.

    The comments that mention do_execve are updated to refer to
    kernel_execve or execve depending on the circumstances. In addition
    to correcting the comments, this makes it easy to grep for do_execve
    and verify it is not used.

    Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for implementiong kernel_execve (which will take kernel
    pointers not userspace pointers) factor out bprm_stack_limits out of
    prepare_arg_pages. This separates the counting which depends upon the
    getting data from userspace from the calculations of the stack limits
    which is usable in kernel_execve.

    The remove prepare_args_pages and compute bprm->argc and bprm->envc
    directly in do_execveat_common, before bprm_stack_limits is called.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Factor bprm_execve
    out of do_execve_common to separate out the copying of arguments
    to the newe stack, and the rest of exec.

    In separating bprm_execve from do_execve_common the copying
    of the arguments onto the new stack happens earlier.

    As the copying of the arguments does not depend any security hooks,
    files, the file table, current->in_execve, current->fs->in_exec,
    bprm->unsafe, or creds this is safe.

    Likewise the security hook security_creds_for_exec does not depend upon
    preventing the argument copying from happening.

    In addition to making it possible to implement kernel_execve that
    performs the copying differently, this separation of bprm_execve from
    do_execve_common makes for a nice separation of responsibilities making
    the exec code easier to navigate.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code that
    launches init to use set_fs so that pages coming from the kernel look like
    they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument copying
    from userspace needs to happen earlier. Move the allocation and
    initialization of bprm->mm into alloc_bprm so that the bprm->mm is
    available early to store the new user stack into. This is a prerequisite
    for copying argv and envp into the new user stack early before ther rest of
    exec.

    To keep the things consistent the cleanup of bprm->mm is moved into
    free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
    allocated and free_bprm are called.

    Moving bprm_mm_init earlier is safe as it does not depend on any files,
    current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
    table is shared. (AKA bprm_mm_init does not depend on any of the code that
    happens between alloc_bprm and where it was previously called.)

    This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
    is safe because current->fs->in_exec is only used to preventy taking an
    additional reference on the fs_struct.

    This moves bprm->mm cleanup after current->in_execve is set to 0. This is
    safe because current->in_execve is only used by the lsms (apparmor and
    tomoyou) and always for LSM specific functions, never for anything to do
    with the mm.

    This adds bprm->mm cleanup into the successful return path. This is safe
    because being on the successful return path implies that begin_new_exec
    succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
    moving into free_bprm will do nothing.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the computation
    of bprm->filename and possible allocation of a name in the case
    of execveat into alloc_bprm to make that possible.

    The exectuable name, the arguments, and the environment are
    copied into the new usermode stack which is stored in bprm
    until exec passes the point of no return.

    As the executable name is copied first onto the usermode stack
    it needs to be known. As there are no dependencies to computing
    the executable name, compute it early in alloc_bprm.

    As an implementation detail if the filename needs to be generated
    because it embeds a file descriptor store that filename in a new field
    bprm->fdpath, and free it in free_bprm. Previously this was done in
    an independent variable pathbuf. I have renamed pathbuf fdpath
    because fdpath is more suggestive of what kind of path is in the
    variable. I moved fdpath into struct linux_binprm because it is
    tightly tied to the other variables in struct linux_binprm, and as
    such is needed to allow the call alloc_binprm to move.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the allocation
    of the bprm into it's own function (alloc_bprm) and move the call of
    alloc_bprm before unshare_files so that bprm can ultimately be
    allocated, the arguments can be placed on the new stack, and then the
    bprm can be passed into the core of exec.

    Neither the allocation of struct binprm nor the unsharing depend upon each
    other so swapping the order in which they are called is trivially safe.

    To keep things consistent the order of cleanup at the end of
    do_execve_common swapped to match the order of initialization.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2020

1 commit

  • Now that the last callser has been removed remove this code from exec.

    For anyone thinking of resurrecing do_execve_file please note that
    the code was buggy in several fundamental ways.

    - It did not ensure the file it was passed was read-only and that
    deny_write_access had been called on it. Which subtlely breaks
    invaniants in exec.

    - The caller of do_execve_file was expected to hold and put a
    reference to the file, but an extra reference for use by exec was
    not taken so that when exec put it's reference to the file an
    underflow occured on the file reference count.

    - The point of the interface was so that a pathname did not need to
    exist. Which breaks pathname based LSMs.

    Tetsuo Handa originally reported these issues[1]. While it was clear
    that deny_write_access was missing the fundamental incompatibility
    with the passed in O_RDWR filehandle was not immediately recognized.

    All of these issues were fixed by modifying the usermode driver code
    to have a path, so it did not need this hack.

    Reported-by: Tetsuo Handa
    [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
    v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

2 commits


05 Jun, 2020

5 commits

  • Merge yet more updates from Andrew Morton:

    - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
    __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

    - Various other little subsystems

    * emailed patches from Andrew Morton : (127 commits)
    lib/ubsan.c: fix gcc-10 warnings
    tools/testing/selftests/vm: remove duplicate headers
    selftests: vm: pkeys: fix multilib builds for x86
    selftests: vm: pkeys: use the correct page size on powerpc
    selftests/vm/pkeys: override access right definitions on powerpc
    selftests/vm/pkeys: test correct behaviour of pkey-0
    selftests/vm/pkeys: introduce a sub-page allocator
    selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
    selftests/vm/pkeys: associate key on a mapped page and detect write violation
    selftests/vm/pkeys: associate key on a mapped page and detect access violation
    selftests/vm/pkeys: improve checks to determine pkey support
    selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
    selftests/vm/pkeys: fix number of reserved powerpc pkeys
    selftests/vm/pkeys: introduce powerpc support
    selftests/vm/pkeys: introduce generic pkey abstractions
    selftests: vm: pkeys: use the correct huge page size
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
    selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
    selftests/vm/pkeys: fix pkey_disable_clear()
    selftests: vm: pkeys: add helpers for pkey bits
    ...

    Linus Torvalds
     
  • Currently copy_string_kernel is just a wrapper around copy_strings that
    simplifies the calling conventions and uses set_fs to allow passing a
    kernel pointer. But due to the fact the we only need to handle a single
    kernel argument pointer, the logic can be sigificantly simplified while
    getting rid of the set_fs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • copy_strings_kernel is always used with a single argument,
    adjust the calling convention to that.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Pull execve updates from Eric Biederman:
    "Last cycle for the Nth time I ran into bugs and quality of
    implementation issues related to exec that could not be easily be
    fixed because of the way exec is implemented. So I have been digging
    into exec and cleanup up what I can.

    I don't think I have exec sorted out enough to fix the issues I
    started with but I have made some headway this cycle with 4 sets of
    changes.

    - promised cleanups after introducing exec_update_mutex

    - trivial cleanups for exec

    - control flow simplifications

    - remove the recomputation of bprm->cred

    The net result is code that is a bit easier to understand and work
    with and a decrease in the number of lines of code (if you don't count
    the added tests)"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
    exec: Compute file based creds only once
    exec: Add a per bprm->file version of per_clear
    binfmt_elf_fdpic: fix execfd build regression
    selftests/exec: Add binfmt_script regression test
    exec: Remove recursion from search_binary_handler
    exec: Generic execfd support
    exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
    exec: Move the call of prepare_binprm into search_binary_handler
    exec: Allow load_misc_binary to call prepare_binprm unconditionally
    exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
    exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
    exec: Teach prepare_exec_creds how exec treats uids & gids
    exec: Set the point of no return sooner
    exec: Move handling of the point of no return to the top level
    exec: Run sync_mm_rss before taking exec_update_mutex
    exec: Fix spelling of search_binary_handler in a comment
    exec: Move the comment from above de_thread to above unshare_sighand
    exec: Rename flush_old_exec begin_new_exec
    exec: Move most of setup_new_exec into flush_old_exec
    exec: In setup_new_exec cache current in the local variable me
    ...

    Linus Torvalds
     
  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     

30 May, 2020

2 commits

  • Move the computation of creds from prepare_binfmt into begin_new_exec
    so that the creds need only be computed once. This is just code
    reorganization no semantic changes of any kind are made.

    Moving the computation is safe. I have looked through the kernel and
    verified none of the binfmts look at bprm->cred directly, and that
    there are no helpers that look at bprm->cred indirectly. Which means
    that it is not a problem to compute the bprm->cred later in the
    execution flow as it is not used until it becomes current->cred.

    A new function bprm_creds_from_file is added to contain the work that
    needs to be done. bprm_creds_from_file first computes which file
    bprm->executable or most likely bprm->file that the bprm->creds
    will be computed from.

    The funciton bprm_fill_uid is updated to receive the file instead of
    accessing bprm->file. The now unnecessary work needed to reset the
    bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
    A small comment to document that bprm_fill_uid now only deals with the
    work to handle suid and sgid files. The default case is already
    heandled by prepare_exec_creds.

    The function security_bprm_repopulate_creds is renamed
    security_bprm_creds_from_file and now is explicitly passed the file
    from which to compute the creds. The documentation of the
    bprm_creds_from_file security hook is updated to explain when the hook
    is called and what it needs to do. The file is passed from
    cap_bprm_creds_from_file into get_file_caps so that the caps are
    computed for the appropriate file. The now unnecessary work in
    cap_bprm_creds_from_file to reset the ambient capabilites has been
    removed. A small comment to document that the work of
    cap_bprm_creds_from_file is to read capabilities from the files
    secureity attribute and derive capabilities from the fact the
    user had uid 0 has been added.

    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • There is a small bug in the code that recomputes parts of bprm->cred
    for every bprm->file. The code never recomputes the part of
    clear_dangerous_personality_flags it is responsible for.

    Which means that in practice if someone creates a sgid script
    the interpreter will not be able to use any of:
    READ_IMPLIES_EXEC
    ADDR_NO_RANDOMIZE
    ADDR_COMPAT_LAYOUT
    MMAP_PAGE_ZERO.

    This accentially clearing of personality flags probably does
    not matter in practice because no one has complained
    but it does make the code more difficult to understand.

    Further remaining bug compatible prevents the recomputation from being
    removed and replaced by simply computing bprm->cred once from the
    final bprm->file.

    Making this change removes the last behavior difference between
    computing bprm->creds from the final file and recomputing
    bprm->cred several times. Which allows this behavior change
    to be justified for it's own reasons, and for any but hunts
    looking into why the behavior changed to wind up here instead
    of in the code that will follow that computes bprm->cred
    from the final bprm->file.

    This small logic bug appears to have existed since the code
    started clearing dangerous personality bits.

    History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 May, 2020

6 commits

  • Recursion in kernel code is generally a bad idea as it can overflow
    the kernel stack. Recursion in exec also hides that the code is
    looping and that the loop changes bprm->file.

    Instead of recursing in search_binary_handler have the methods that
    would recurse set bprm->interpreter and return 0. Modify exec_binprm
    to loop when bprm->interpreter is set. Consolidate all of the
    reassignments of bprm->file in that loop to make it clear what is
    going on.

    The structure of the new loop in exec_binprm is that all errors return
    immediately, while successful completion (ret == 0 &&
    !bprm->interpreter) just breaks out of the loop and runs what
    exec_bprm has always run upon successful completion.

    Fail if the an interpreter is being call after execfd has been set.
    The code has never properly handled an interpreter being called with
    execfd being set and with reassignments of bprm->file and the
    assignment of bprm->executable in generic code it has finally become
    possible to test and fail when if this problematic condition happens.

    With the reassignments of bprm->file and the assignment of
    bprm->executable moved into the generic code add a test to see if
    bprm->executable is being reassigned.

    In search_binary_handler remove the test for !bprm->file. With all
    reassignments of bprm->file moved to exec_binprm bprm->file can never
    be NULL in search_binary_handler.

    Link: https://lkml.kernel.org/r/87sgfwyd84.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Most of the support for passing the file descriptor of an executable
    to an interpreter already lives in the generic code and in binfmt_elf.
    Rework the fields in binfmt_elf that deal with executable file
    descriptor passing to make executable file descriptor passing a first
    class concept.

    Move the fd_install from binfmt_misc into begin_new_exec after the new
    creds have been installed. This means that accessing the file through
    /proc//fd/N is able to see the creds for the new executable
    before allowing access to the new executables files.

    Performing the install of the executables file descriptor after
    the point of no return also means that nothing special needs to
    be done on error. The exiting of the process will close all
    of it's open files.

    Move the would_dump from binfmt_misc into begin_new_exec right
    after would_dump is called on the bprm->file. This makes it
    obvious this case exists and that no nesting of bprm->file is
    currently supported.

    In binfmt_misc the movement of fd_install into generic code means
    that it's special error exit path is no longer needed.

    Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The code in prepare_binary_handler needs to be run every time
    search_binary_handler is called so move the call into search_binary_handler
    itself to make the code simpler and easier to understand.

    Link: https://lkml.kernel.org/r/87d070zrvx.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Reviewed-by: James Morris
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Add a flag preserve_creds that binfmt_misc can set to prevent
    credentials from being updated. This allows binfmt_misc to always
    call prepare_binprm. Allowing the credential computation logic to be
    consolidated.

    Not replacing the credentials with the interpreters credentials is
    safe because because an open file descriptor to the executable is
    passed to the interpreter. As the interpreter does not need to
    reopen the executable it is guaranteed to see the same file that
    exec sees.

    Ref: c407c033de84 ("[PATCH] binfmt_misc: improve calculation of interpreter's credentials")
    Link: https://lkml.kernel.org/r/87imgszrwo.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Rename bprm->cap_elevated to bprm->active_secureexec and initialize it
    in prepare_binprm instead of in cap_bprm_set_creds. Initializing
    bprm->active_secureexec in prepare_binprm allows multiple
    implementations of security_bprm_repopulate_creds to play nicely with
    each other.

    Rename security_bprm_set_creds to security_bprm_reopulate_creds to
    emphasize that this path recomputes part of bprm->cred. This
    recomputation avoids the time of check vs time of use problems that
    are inherent in unix #! interpreters.

    In short two renames and a move in the location of initializing
    bprm->active_secureexec.

    Link: https://lkml.kernel.org/r/87o8qkzrxp.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Today security_bprm_set_creds has several implementations:
    apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
    smack_bprm_set_creds, and tomoyo_bprm_set_creds.

    Except for cap_bprm_set_creds they all test bprm->called_set_creds and
    return immediately if it is true. The function cap_bprm_set_creds
    ignores bprm->calld_sed_creds entirely.

    Create a new LSM hook security_bprm_creds_for_exec that is called just
    before prepare_binprm in __do_execve_file, resulting in a LSM hook
    that is called exactly once for the entire of exec. Modify the bits
    of security_bprm_set_creds that only want to be called once per exec
    into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
    behind.

    Remove bprm->called_set_creds all of it's former users have been moved
    to security_bprm_creds_for_exec.

    Add or upate comments a appropriate to bring them up to date and
    to reflect this change.

    Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Acked-by: Casey Schaufler # For the LSM and Smack bits
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 May, 2020

1 commit


17 May, 2020

1 commit

  • I goofed when I added mm->user_ns support to would_dump. I missed the
    fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
    binfmt_script bprm->file is reassigned. Which made the move of
    would_dump from setup_new_exec to __do_execve_file before exec_binprm
    incorrect as it can result in would_dump running on the script instead
    of the interpreter of the script.

    The net result is that the code stopped making unreadable interpreters
    undumpable. Which allows them to be ptraced and written to disk
    without special permissions. Oops.

    The move was necessary because the call in set_new_exec was after
    bprm->mm was no longer valid.

    To correct this mistake move the misplaced would_dump from
    __do_execve_file into flos_old_exec, before exec_mmap is called.

    I tested and confirmed that without this fix I can attach with gdb to
    a script with an unreadable interpreter, and with this fix I can not.

    Cc: stable@vger.kernel.org
    Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 May, 2020

1 commit

  • Make the code more robust by marking the point of no return sooner.
    This ensures that future code changes don't need to worry about how
    they return errors if they are past this point.

    This results in no actual change in behavior as __do_execve_file does
    not force SIGSEGV when there is a pending fatal signal pending past
    the point of no return. Further the only error returns from de_thread
    and exec_mmap that can occur result in fatal signals being pending.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87sgga5klu.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman