25 Feb, 2021

1 commit

  • Delete duplicate words in fs/*.c.
    The doubled words that are being dropped are:
    that, be, the, in, and, for

    Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

24 Feb, 2021

1 commit

  • Pull idmapped mounts from Christian Brauner:
    "This introduces idmapped mounts which has been in the making for some
    time. Simply put, different mounts can expose the same file or
    directory with different ownership. This initial implementation comes
    with ports for fat, ext4 and with Christoph's port for xfs with more
    filesystems being actively worked on by independent people and
    maintainers.

    Idmapping mounts handle a wide range of long standing use-cases. Here
    are just a few:

    - Idmapped mounts make it possible to easily share files between
    multiple users or multiple machines especially in complex
    scenarios. For example, idmapped mounts will be used in the
    implementation of portable home directories in
    systemd-homed.service(8) where they allow users to move their home
    directory to an external storage device and use it on multiple
    computers where they are assigned different uids and gids. This
    effectively makes it possible to assign random uids and gids at
    login time.

    - It is possible to share files from the host with unprivileged
    containers without having to change ownership permanently through
    chown(2).

    - It is possible to idmap a container's rootfs and without having to
    mangle every file. For example, Chromebooks use it to share the
    user's Download folder with their unprivileged containers in their
    Linux subsystem.

    - It is possible to share files between containers with
    non-overlapping idmappings.

    - Filesystem that lack a proper concept of ownership such as fat can
    use idmapped mounts to implement discretionary access (DAC)
    permission checking.

    - They allow users to efficiently changing ownership on a per-mount
    basis without having to (recursively) chown(2) all files. In
    contrast to chown (2) changing ownership of large sets of files is
    instantenous with idmapped mounts. This is especially useful when
    ownership of a whole root filesystem of a virtual machine or
    container is changed. With idmapped mounts a single syscall
    mount_setattr syscall will be sufficient to change the ownership of
    all files.

    - Idmapped mounts always take the current ownership into account as
    idmappings specify what a given uid or gid is supposed to be mapped
    to. This contrasts with the chown(2) syscall which cannot by itself
    take the current ownership of the files it changes into account. It
    simply changes the ownership to the specified uid and gid. This is
    especially problematic when recursively chown(2)ing a large set of
    files which is commong with the aforementioned portable home
    directory and container and vm scenario.

    - Idmapped mounts allow to change ownership locally, restricting it
    to specific mounts, and temporarily as the ownership changes only
    apply as long as the mount exists.

    Several userspace projects have either already put up patches and
    pull-requests for this feature or will do so should you decide to pull
    this:

    - systemd: In a wide variety of scenarios but especially right away
    in their implementation of portable home directories.

    https://systemd.io/HOME_DIRECTORY/

    - container runtimes: containerd, runC, LXD:To share data between
    host and unprivileged containers, unprivileged and privileged
    containers, etc. The pull request for idmapped mounts support in
    containerd, the default Kubernetes runtime is already up for quite
    a while now: https://github.com/containerd/containerd/pull/4734

    - The virtio-fs developers and several users have expressed interest
    in using this feature with virtual machines once virtio-fs is
    ported.

    - ChromeOS: Sharing host-directories with unprivileged containers.

    I've tightly synced with all those projects and all of those listed
    here have also expressed their need/desire for this feature on the
    mailing list. For more info on how people use this there's a bunch of
    talks about this too. Here's just two recent ones:

    https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
    https://fosdem.org/2021/schedule/event/containers_idmap/

    This comes with an extensive xfstests suite covering both ext4 and
    xfs:

    https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

    It covers truncation, creation, opening, xattrs, vfscaps, setid
    execution, setgid inheritance and more both with idmapped and
    non-idmapped mounts. It already helped to discover an unrelated xfs
    setgid inheritance bug which has since been fixed in mainline. It will
    be sent for inclusion with the xfstests project should you decide to
    merge this.

    In order to support per-mount idmappings vfsmounts are marked with
    user namespaces. The idmapping of the user namespace will be used to
    map the ids of vfs objects when they are accessed through that mount.
    By default all vfsmounts are marked with the initial user namespace.
    The initial user namespace is used to indicate that a mount is not
    idmapped. All operations behave as before and this is verified in the
    testsuite.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users
    to setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.

    The user namespace the mount will be marked with can be specified by
    passing a file descriptor refering to the user namespace as an
    argument to the new mount_setattr() syscall together with the new
    MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
    of extensibility.

    The following conditions must be met in order to create an idmapped
    mount:

    - The caller must currently have the CAP_SYS_ADMIN capability in the
    user namespace the underlying filesystem has been mounted in.

    - The underlying filesystem must support idmapped mounts.

    - The mount must not already be idmapped. This also implies that the
    idmapping of a mount cannot be altered once it has been idmapped.

    - The mount must be a detached/anonymous mount, i.e. it must have
    been created by calling open_tree() with the OPEN_TREE_CLONE flag
    and it must not already have been visible in the filesystem.

    The last two points guarantee easier semantics for userspace and the
    kernel and make the implementation significantly simpler.

    By default vfsmounts are marked with the initial user namespace and no
    behavioral or performance changes are observed.

    The manpage with a detailed description can be found here:

    https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

    In order to support idmapped mounts, filesystems need to be changed
    and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
    patches to convert individual filesystem are not very large or
    complicated overall as can be seen from the included fat, ext4, and
    xfs ports. Patches for other filesystems are actively worked on and
    will be sent out separately. The xfstestsuite can be used to verify
    that port has been done correctly.

    The mount_setattr() syscall is motivated independent of the idmapped
    mounts patches and it's been around since July 2019. One of the most
    valuable features of the new mount api is the ability to perform
    mounts based on file descriptors only.

    Together with the lookup restrictions available in the openat2()
    RESOLVE_* flag namespace which we added in v5.6 this is the first time
    we are close to hardened and race-free (e.g. symlinks) mounting and
    path resolution.

    While userspace has started porting to the new mount api to mount
    proper filesystems and create new bind-mounts it is currently not
    possible to change mount options of an already existing bind mount in
    the new mount api since the mount_setattr() syscall is missing.

    With the addition of the mount_setattr() syscall we remove this last
    restriction and userspace can now fully port to the new mount api,
    covering every use-case the old mount api could. We also add the
    crucial ability to recursively change mount options for a whole mount
    tree, both removing and adding mount options at the same time. This
    syscall has been requested multiple times by various people and
    projects.

    There is a simple tool available at

    https://github.com/brauner/mount-idmapped

    that allows to create idmapped mounts so people can play with this
    patch series. I'll add support for the regular mount binary should you
    decide to pull this in the following weeks:

    Here's an example to a simple idmapped mount of another user's home
    directory:

    u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

    u1001@f2-vm:/$ ls -al /home/ubuntu/
    total 28
    drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
    drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
    -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
    -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ ls -al /mnt/
    total 28
    drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
    drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
    -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
    -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ touch /mnt/my-file

    u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

    u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

    u1001@f2-vm:/$ ls -al /mnt/my-file
    -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

    u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
    -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

    u1001@f2-vm:/$ getfacl /mnt/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: mnt/my-file
    # owner: u1001
    # group: u1001
    user::rw-
    user:u1001:rwx
    group::rw-
    mask::rwx
    other::r--

    u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: home/ubuntu/my-file
    # owner: ubuntu
    # group: ubuntu
    user::rw-
    user:ubuntu:rwx
    group::rw-
    mask::rwx
    other::r--"

    * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
    xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
    xfs: support idmapped mounts
    ext4: support idmapped mounts
    fat: handle idmapped mounts
    tests: add mount_setattr() selftests
    fs: introduce MOUNT_ATTR_IDMAP
    fs: add mount_setattr()
    fs: add attr_flags_to_mnt_flags helper
    fs: split out functions to hold writers
    namespace: only take read lock in do_reconfigure_mnt()
    mount: make {lock,unlock}_mount_hash() static
    namespace: take lock_mount_hash() directly when changing flags
    nfs: do not export idmapped mounts
    overlayfs: do not mount on top of idmapped mounts
    ecryptfs: do not mount on top of idmapped mounts
    ima: handle idmapped mounts
    apparmor: handle idmapped mounts
    fs: make helpers idmap mount aware
    exec: handle idmapped mounts
    would_dump: handle idmapped mounts
    ...

    Linus Torvalds
     

30 Jan, 2021

2 commits

  • The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
    needed now that there is a separate function for 'fullmm' flushing.

    Remove the unused arguments and update all callers.

    Suggested-by: Linus Torvalds
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Yu Zhao
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com

    Will Deacon
     
  • Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range()
    for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
    are no longer used, since we flush the whole mm in case of a nested
    invalidation.

    Remove the unused arguments and update all callers.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Yu Zhao
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org

    Will Deacon
     

24 Jan, 2021

4 commits

  • When executing a setuid binary the kernel will verify in bprm_fill_uid()
    that the inode has a mapping in the caller's user namespace before
    setting the callers uid and gid. Let bprm_fill_uid() handle idmapped
    mounts. If the inode is accessed through an idmapped mount it is mapped
    according to the mount's user namespace. Afterwards the checks are
    identical to non-idmapped mounts. If the initial user namespace is
    passed nothing changes so non-idmapped mounts will see identical
    behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When determining whether or not to create a coredump the vfs will verify
    that the caller is privileged over the inode. Make the would_dump()
    helper handle idmapped mounts by passing down the mount's user namespace
    of the exec file. If the initial user namespace is passed nothing
    changes so non-idmapped mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The two helpers inode_permission() and generic_permission() are used by
    the vfs to perform basic permission checking by verifying that the
    caller is privileged over an inode. In order to handle idmapped mounts
    we extend the two helpers with an additional user namespace argument.
    On idmapped mounts the two helpers will make sure to map the inode
    according to the mount's user namespace and then peform identical
    permission checks to inode_permission() and generic_permission(). If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • In order to determine whether a caller holds privilege over a given
    inode the capability framework exposes the two helpers
    privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
    verifies that the inode has a mapping in the caller's user namespace and
    the latter additionally verifies that the caller has the requested
    capability in their current user namespace.
    If the inode is accessed through an idmapped mount map it into the
    mount's user namespace. Afterwards the checks are identical to
    non-idmapped inodes. If the initial user namespace is passed all
    operations are a nop so non-idmapped mounts will not see a change in
    behavior.

    Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     

17 Dec, 2020

1 commit

  • Pull parisc updates from Helge Deller:
    "A change to increase the default maximum stack size on parisc to 100MB
    and the ability to further increase the stack hard limit size at
    runtime with ulimit for newly started processes.

    The other patches fix compile warnings, utilize the Kbuild logic and
    cleanups the parisc arch code"

    * 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc: pci-dma: fix warning unused-function
    parisc/uapi: Use Kbuild logic to provide
    parisc: Make user stack size configurable
    parisc: Use _TIF_USER_WORK_MASK in entry.S
    parisc: Drop loops_per_jiffy from per_cpu struct

    Linus Torvalds
     

16 Dec, 2020

2 commits

  • …kernel/git/ebiederm/user-namespace

    Pull exec-update-lock update from Eric Biederman:
    "The key point of this is to transform exec_update_mutex into a
    rw_semaphore so readers can be separated from writers.

    This makes it easier to understand what the holders of the lock are
    doing, and makes it harder to contend or deadlock on the lock.

    The real deadlock fix wound up in perf_event_open"

    * 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    exec: Transform exec_update_mutex into a rw_semaphore

    Linus Torvalds
     
  • …biederm/user-namespace

    Pull execve updates from Eric Biederman:
    "This set of changes ultimately fixes the interaction of posix file
    lock and exec. Fundamentally most of the change is just moving where
    unshare_files is called during exec, and tweaking the users of
    files_struct so that the count of files_struct is not unnecessarily
    played with.

    Along the way fcheck and related helpers were renamed to more
    accurately reflect what they do.

    There were also many other small changes that fell out, as this is the
    first time in a long time much of this code has been touched.

    Benchmarks haven't turned up any practical issues but Al Viro has
    observed a possibility for a lot of pounding on task_lock. So I have
    some changes in progress to convert put_files_struct to always rcu
    free files_struct. That wasn't ready for the merge window so that will
    have to wait until next time"

    * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    exec: Move io_uring_task_cancel after the point of no return
    coredump: Document coredump code exclusively used by cell spufs
    file: Remove get_files_struct
    file: Rename __close_fd_get_file close_fd_get_file
    file: Replace ksys_close with close_fd
    file: Rename __close_fd to close_fd and remove the files parameter
    file: Merge __alloc_fd into alloc_fd
    file: In f_dupfd read RLIMIT_NOFILE once.
    file: Merge __fd_install into fd_install
    proc/fd: In fdinfo seq_show don't use get_files_struct
    bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
    proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
    file: Implement task_lookup_next_fd_rcu
    kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
    proc/fd: In tid_fd_mode use task_lookup_fd_rcu
    file: Implement task_lookup_fd_rcu
    file: Rename fcheck lookup_fd_rcu
    file: Replace fcheck_files with files_lookup_fd_rcu
    file: Factor files_lookup_fd_locked out of fcheck_files
    file: Rename __fcheck_files to files_lookup_fd_raw
    ...

    Linus Torvalds
     

11 Dec, 2020

5 commits

  • Recently syzbot reported[0] that there is a deadlock amongst the users
    of exec_update_mutex. The problematic lock ordering found by lockdep
    was:

    perf_event_open (exec_update_mutex -> ovl_i_mutex)
    chown (ovl_i_mutex -> sb_writes)
    sendfile (sb_writes -> p->lock)
    by reading from a proc file and writing to overlayfs
    proc_pid_syscall (p->lock -> exec_update_mutex)

    While looking at possible solutions it occured to me that all of the
    users and possible users involved only wanted to state of the given
    process to remain the same. They are all readers. The only writer is
    exec.

    There is no reason for readers to block on each other. So fix
    this deadlock by transforming exec_update_mutex into a rw_semaphore
    named exec_update_lock that only exec takes for writing.

    Cc: Jann Horn
    Cc: Vasiliy Kulikov
    Cc: Al Viro
    Cc: Bernd Edlinger
    Cc: Oleg Nesterov
    Cc: Christopher Yeoh
    Cc: Cyrill Gorcunov
    Cc: Sargun Dhillon
    Cc: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
    [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Now that unshare_files happens in begin_new_exec after the point of no
    return, io_uring_task_cancel can also happen later.

    Effectively this means io_uring activities for a task are only canceled
    when exec succeeds.

    Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Now that exec no longer needs to return the unshared files to their
    previous value there is no reason to return displaced.

    Instead when unshare_fd creates a copy of the file table, call
    put_files_struct before returning from unshare_files.

    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Many moons ago the binfmts were doing some very questionable things
    with file descriptors and an unsharing of the file descriptor table
    was added to make things better[1][2]. The helper steal_lockss was
    added to avoid breaking the userspace programs[3][4][6].

    Unfortunately it turned out that steal_locks did not work for network
    file systems[5], so it was removed to see if anyone would
    complain[7][8]. It was thought at the time that NPTL would not be
    affected as the unshare_files happened after the other threads were
    killed[8]. Unfortunately because there was an unshare_files in
    binfmt_elf.c before the threads were killed this analysis was
    incorrect.

    This unshare_files in binfmt_elf.c resulted in the unshares_files
    happening whenever threads were present. Which led to unshare_files
    being moved to the start of do_execve[9].

    Later the problems were rediscovered and the suggested approach was to
    readd steal_locks under a different name[10]. I happened to be
    reviewing patches and I noticed that this approach was a step
    backwards[11].

    I proposed simply moving unshare_files[12] and it was pointed
    out that moving unshare_files without auditing the code was
    also unsafe[13].

    There were then several attempts to solve this[14][15][16] and I even
    posted this set of changes[17]. Unfortunately because auditing all of
    execve is time consuming this change did not make it in at the time.

    Well now that I am cleaning up exec I have made the time to read
    through all of the binfmts and the only playing with file descriptors
    is either the security modules closing them in
    security_bprm_committing_creds or is in the generic code in fs/exec.c.
    None of it happens before begin_new_exec is called.

    So move unshare_files into begin_new_exec, after the point of no
    return. If memory is very very very low and the application calling
    exec is sharing file descriptor tables between processes we might fail
    past the point of no return. Which is unfortunate but no different
    than any of the other places where we allocate memory after the point
    of no return.

    This movement allows another process that shares the file table, or
    another thread of the same process and that closes files or changes
    their close on exec behavior and races with execve to cause some
    unexpected things to happen. There is only one time of check to time
    of use race and it is just there so that execve fails instead of
    an interpreter failing when it tries to open the file it is supposed
    to be interpreting. Failing later if userspace is being silly is
    not a problem.

    With this change it the following discription from the removal
    of steal_locks[8] finally becomes true.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

    - have multiple threads during exec (LinuxThreads doesn't kill other
    threads, the app may do it with pthread_kill_other_threads_np())
    - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

    - use clone(CLONE_FILES)
    - rely on POSIX locks being inherited across exec

    I have investigated some paths to make it possible to solve this
    without moving unshare_files but they all look more complicated[18].

    Reported-by: Daniel P. Berrangé
    Reported-by: Jeff Layton
    History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    [1] 02cda956de0b ("[PATCH] unshare_files"
    [2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
    [3] 088f5d7244de ("[PATCH] add steal_locks helper")
    [4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
    [5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
    [6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
    [7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
    [8] c89681ed7d0e ("[PATCH] remove steal_locks()")
    [9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
    [10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org
    [11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com
    [12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com
    [13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk
    [14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org
    [15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org
    [16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org
    [17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com
    [18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org
    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-1-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Al Viro pointed out that using the phrase "close_on_exec(fd,
    rcu_dereference_raw(current->files->fdt))" instead of wrapping it in
    rcu_read_lock(), rcu_read_unlock() is a very questionable
    optimization[1].

    Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase
    becomes equivalent the helper function get_close_on_exec so
    simplify the code and make it more robust by simply using
    get_close_on_exec.

    [1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk
    Suggested-by: Al Viro
    Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

02 Dec, 2020

1 commit

  • Introduce a mechanism to quickly disable/enable syscall handling for a
    specific process and redirect to userspace via SIGSYS. This is useful
    for processes with parts that require syscall redirection and parts that
    don't, but who need to perform this boundary crossing really fast,
    without paying the cost of a system call to reconfigure syscall handling
    on each boundary transition. This is particularly important for Windows
    games running over Wine.

    The proposed interface looks like this:

    prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , [selector])

    The range [,+) is a part of the process memory
    map that is allowed to by-pass the redirection code and dispatch
    syscalls directly, such that in fast paths a process doesn't need to
    disable the trap nor the kernel has to check the selector. This is
    essential to return from SIGSYS to a blocked area without triggering
    another SIGSYS from rt_sigreturn.

    selector is an optional pointer to a char-sized userspace memory region
    that has a key switch for the mechanism. This key switch is set to
    either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
    redirection without calling the kernel.

    The feature is meant to be set per-thread and it is disabled on
    fork/clone/execv.

    Internally, this doesn't add overhead to the syscall hot path, and it
    requires very little per-architecture support. I avoided using seccomp,
    even though it duplicates some functionality, due to previous feedback
    that maybe it shouldn't mix with seccomp since it is not a security
    mechanism. And obviously, this should never be considered a security
    mechanism, since any part of the program can by-pass it by using the
    syscall dispatcher.

    For the sysinfo benchmark, which measures the overhead added to
    executing a native syscall that doesn't require interception, the
    overhead using only the direct dispatcher region to issue syscalls is
    pretty much irrelevant. The overhead of using the selector goes around
    40ns for a native (unredirected) syscall in my system, and it is (as
    expected) dominated by the supervisor-mode user-address access. In
    fact, with SMAP off, the overhead is consistently less than 5ns on my
    test box.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Andy Lutomirski
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Kees Cook
    Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com

    Gabriel Krisman Bertazi
     

11 Nov, 2020

1 commit

  • On parisc we need to initialize the memory layout for the user stack at
    process start time to a fixed size, which up until now was limited to
    the size as given by CONFIG_MAX_STACK_SIZE_MB at compile time.

    This hard limit was too small and showed problems when compiling
    ruby2.7, qmlcachegen and some Qt packages.

    This patch changes two things:
    a) It increases the default maximum stack size to 100MB.
    b) Users can modify the stack hard limit size with ulimit and then newly
    forked processes will use the given stack size which can even be bigger
    than the default 100MB.

    Reported-by: John David Anglin
    Signed-off-by: Helge Deller

    Helge Deller
     

17 Oct, 2020

1 commit

  • Pull powerpc updates from Michael Ellerman:

    - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
    it for powerpc, as well as a related fix for sparc.

    - Remove support for PowerPC 601.

    - Some fixes for watchpoints & addition of a new ptrace flag for
    detecting ISA v3.1 (Power10) watchpoint features.

    - A fix for kernels using 4K pages and the hash MMU on bare metal
    Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

    - A basic idle driver for shallow stop states on Power10.

    - Tweaks to our sched domains code to better inform the scheduler about
    the hardware topology on Power9/10, where two SMT4 cores can be
    presented by firmware as an SMT8 core.

    - A series doing further reworks & cleanups of our EEH code.

    - Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
    to prevent root from overwriting kernel memory.

    - Other smaller features, fixes & cleanups.

    Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
    Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
    Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
    Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
    Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
    Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
    Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
    Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
    Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
    Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
    Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
    Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
    Yingliang, zhengbin.

    * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
    Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
    selftests/powerpc: Fix eeh-basic.sh exit codes
    cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
    powerpc/time: Make get_tb() common to PPC32 and PPC64
    powerpc/time: Make get_tbl() common to PPC32 and PPC64
    powerpc/time: Remove get_tbu()
    powerpc/time: Avoid using get_tbl() and get_tbu() internally
    powerpc/time: Make mftb() common to PPC32 and PPC64
    powerpc/time: Rename mftbl() to mftb()
    powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
    powerpc/32s: Rename head_32.S to head_book3s_32.S
    powerpc/32s: Setup the early hash table at all time.
    powerpc/time: Remove ifdef in get_dec() and set_dec()
    powerpc: Remove get_tb_or_rtc()
    powerpc: Remove __USE_RTC()
    powerpc: Tidy up a bit after removal of PowerPC 601.
    powerpc: Remove support for PowerPC 601
    powerpc: Remove PowerPC 601
    powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
    powerpc: Remove CONFIG_PPC601_SYNC_FIX
    ...

    Linus Torvalds
     

16 Oct, 2020

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char, misc, and other assorted driver subsystem
    patches for 5.10-rc1.

    There's a lot of different things in here, all over the drivers/
    directory. Some summaries:

    - soundwire driver updates

    - habanalabs driver updates

    - extcon driver updates

    - nitro_enclaves new driver

    - fsl-mc driver and core updates

    - mhi core and bus updates

    - nvmem driver updates

    - eeprom driver updates

    - binder driver updates and fixes

    - vbox minor bugfixes

    - fsi driver updates

    - w1 driver updates

    - coresight driver updates

    - interconnect driver updates

    - misc driver updates

    - other minor driver updates

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
    binder: fix UAF when releasing todo list
    docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
    misc: Kconfig: fix a HISI_HIKEY_USB dependency
    LSM: Fix type of id parameter in kernel_post_load_data prototype
    misc: Kconfig: add a new dependency for HISI_HIKEY_USB
    firmware_loader: fix a kernel-doc markup
    w1: w1_therm: make w1_poll_completion static
    binder: simplify the return expression of binder_mmap
    test_firmware: Test partial read support
    firmware: Add request_partial_firmware_into_buf()
    firmware: Store opt_flags in fw_priv
    fs/kernel_file_read: Add "offset" arg for partial reads
    IMA: Add support for file reads without contents
    LSM: Add "contents" flag to kernel_read_file hook
    module: Call security_kernel_post_load_data()
    firmware_loader: Use security_post_load_data()
    LSM: Introduce kernel_post_load_data() hook
    fs/kernel_read_file: Add file_size output argument
    fs/kernel_read_file: Switch buffer size arg to size_t
    fs/kernel_read_file: Remove redundant size argument
    ...

    Linus Torvalds
     

05 Oct, 2020

3 commits

  • These routines are used in places outside of exec(2), so in preparation
    for refactoring them, move them into a separate source file,
    fs/kernel_read_file.c.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Scott Branden
    Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h
    include file. That header gets pulled in just about everywhere
    and doesn't really need functions not related to the general fs interface.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Scott Branden
    Signed-off-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Greg Kroah-Hartman
    Acked-by: James Morris
    Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com
    Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Scott Branden
     
  • FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
    that are interested in filtering between types of things. The "how"
    should be an internal detail made uninteresting to the LSMs.

    Fixes: a098ecd2fa7d ("firmware: support loading into a pre-allocated buffer")
    Fixes: fd90bc559bfb ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
    Fixes: 4f0496d8ffa3 ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Reviewed-by: Luis Chamberlain
    Acked-by: Scott Branden
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

01 Oct, 2020

1 commit

  • Grab actual references to the files_struct. To avoid circular references
    issues due to this, we add a per-task note that keeps track of what
    io_uring contexts a task has used. When the tasks execs or exits its
    assigned files, we cancel requests based on this tracking.

    With that, we can grab proper references to the files table, and no
    longer need to rely on stashing away ring_fd and ring_file to check
    if the ring_fd may have been closed.

    Cc: stable@vger.kernel.org # v5.5+
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Sep, 2020

1 commit

  • Reading and modifying current->mm and current->active_mm and switching
    mm should be done with irqs off, to prevent races seeing an intermediate
    state.

    This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
    invalidate"). At exec-time when the new mm is activated, the old one
    should usually be single-threaded and no longer used, unless something
    else is holding an mm_users reference (which may be possible).

    Absent other mm_users, there is also a race with preemption and lazy tlb
    switching. Consider the kernel_execve case where the current thread is
    using a lazy tlb active mm:

    call_usermodehelper()
    kernel_execve()
    old_mm = current->mm;
    active_mm = current->active_mm;
    *** preempt *** --------------------> schedule()
    prev->active_mm = NULL;
    mmdrop(prev active_mm);
    ...
    mm = mm;
    current->active_mm = mm;
    if (!old_mm)
    mmdrop(active_mm);

    If we switch back to the kernel thread from a different mm, there is a
    double free of the old active_mm, and a missing free of the new one.

    Closing this race only requires interrupts to be disabled while ->mm
    and ->active_mm are being switched, but the TLB problem requires also
    holding interrupts off over activate_mm. Unfortunately not all archs
    can do that yet, e.g., arm defers the switch if irqs are disabled and
    expects finish_arch_post_lock_switch() to be called to complete the
    flush; um takes a blocking lock in activate_mm().

    So as a first step, disable interrupts across the mm/active_mm updates
    to close the lazy tlb preempt race, and provide an arch option to
    extend that to activate_mm which allows architectures doing IPI based
    TLB shootdowns to close the second race.

    This is a bit ugly, but in the interest of fixing the bug and backporting
    before all architectures are converted this is a compromise.

    Signed-off-by: Nicholas Piggin
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com

    Nicholas Piggin
     

13 Aug, 2020

5 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • The path_noexec() check, like the regular file check, was happening too
    late, letting LSMs see impossible execve()s. Check it earlier as well in
    may_open() and collect the redundant fs/exec.c path_noexec() test under
    the same robustness comment as the S_ISREG() check.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs path_noexec() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    security_file_open(f)
    open()
    /* old location of path_noexec() test */

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Aleksa Sarai
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "Relocate execve() sanity checks", v2.

    While looking at the code paths for the proposed O_MAYEXEC flag, I saw
    some things that looked like they should be fixed up.

    exec: Change uselib(2) IS_SREG() failure to EACCES
    This just regularizes the return code on uselib(2).

    exec: Move S_ISREG() check earlier
    This moves the S_ISREG() check even earlier than it was already.

    exec: Move path_noexec() check earlier
    This adds the path_noexec() check to the same place as the
    S_ISREG() check.

    This patch (of 3):

    Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
    the behavior matches execve(2), and the seemingly documented value. The
    "not a regular file" failure mode of execve(2) is explicitly
    documented[1], but it is not mentioned in uselib(2)[2] which does,
    however, say that open(2) and mmap(2) errors may apply. The documentation
    for open(2) does not include a "not a regular file" error[3], but mmap(2)
    does[4], and it is EACCES.

    [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
    [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
    [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
    [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org
    Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Both exec and exit want to ensure that the uaccess routines actually do
    access user pointers. Use the newly added force_uaccess_begin helper
    instead of an open coded set_fs for that to prepare for kernel builds
    where set_fs() does not exist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Nick Hu
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

21 Jul, 2020

6 commits

  • To allow the kernel not to play games with set_fs to call exec
    implement kernel_execve. The function kernel_execve takes pointers
    into kernel memory and copies the values pointed to onto the new
    userspace stack.

    The calls with arguments from kernel space of do_execve are replaced
    with calls to kernel_execve.

    The calls do_execve and do_execveat are made static as there are now
    no callers outside of exec.

    The comments that mention do_execve are updated to refer to
    kernel_execve or execve depending on the circumstances. In addition
    to correcting the comments, this makes it easy to grep for do_execve
    and verify it is not used.

    Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • In preparation for implementiong kernel_execve (which will take kernel
    pointers not userspace pointers) factor out bprm_stack_limits out of
    prepare_arg_pages. This separates the counting which depends upon the
    getting data from userspace from the calculations of the stack limits
    which is usable in kernel_execve.

    The remove prepare_args_pages and compute bprm->argc and bprm->envc
    directly in do_execveat_common, before bprm_stack_limits is called.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Factor bprm_execve
    out of do_execve_common to separate out the copying of arguments
    to the newe stack, and the rest of exec.

    In separating bprm_execve from do_execve_common the copying
    of the arguments onto the new stack happens earlier.

    As the copying of the arguments does not depend any security hooks,
    files, the file table, current->in_execve, current->fs->in_exec,
    bprm->unsafe, or creds this is safe.

    Likewise the security hook security_creds_for_exec does not depend upon
    preventing the argument copying from happening.

    In addition to making it possible to implement kernel_execve that
    performs the copying differently, this separation of bprm_execve from
    do_execve_common makes for a nice separation of responsibilities making
    the exec code easier to navigate.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code that
    launches init to use set_fs so that pages coming from the kernel look like
    they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument copying
    from userspace needs to happen earlier. Move the allocation and
    initialization of bprm->mm into alloc_bprm so that the bprm->mm is
    available early to store the new user stack into. This is a prerequisite
    for copying argv and envp into the new user stack early before ther rest of
    exec.

    To keep the things consistent the cleanup of bprm->mm is moved into
    free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
    allocated and free_bprm are called.

    Moving bprm_mm_init earlier is safe as it does not depend on any files,
    current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
    table is shared. (AKA bprm_mm_init does not depend on any of the code that
    happens between alloc_bprm and where it was previously called.)

    This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
    is safe because current->fs->in_exec is only used to preventy taking an
    additional reference on the fs_struct.

    This moves bprm->mm cleanup after current->in_execve is set to 0. This is
    safe because current->in_execve is only used by the lsms (apparmor and
    tomoyou) and always for LSM specific functions, never for anything to do
    with the mm.

    This adds bprm->mm cleanup into the successful return path. This is safe
    because being on the successful return path implies that begin_new_exec
    succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
    moving into free_bprm will do nothing.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the computation
    of bprm->filename and possible allocation of a name in the case
    of execveat into alloc_bprm to make that possible.

    The exectuable name, the arguments, and the environment are
    copied into the new usermode stack which is stored in bprm
    until exec passes the point of no return.

    As the executable name is copied first onto the usermode stack
    it needs to be known. As there are no dependencies to computing
    the executable name, compute it early in alloc_bprm.

    As an implementation detail if the filename needs to be generated
    because it embeds a file descriptor store that filename in a new field
    bprm->fdpath, and free it in free_bprm. Previously this was done in
    an independent variable pathbuf. I have renamed pathbuf fdpath
    because fdpath is more suggestive of what kind of path is in the
    variable. I moved fdpath into struct linux_binprm because it is
    tightly tied to the other variables in struct linux_binprm, and as
    such is needed to allow the call alloc_binprm to move.

    Reviewed-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Currently it is necessary for the usermode helper code and the code
    that launches init to use set_fs so that pages coming from the kernel
    look like they are coming from userspace.

    To allow that usage of set_fs to be removed cleanly the argument
    copying from userspace needs to happen earlier. Move the allocation
    of the bprm into it's own function (alloc_bprm) and move the call of
    alloc_bprm before unshare_files so that bprm can ultimately be
    allocated, the arguments can be placed on the new stack, and then the
    bprm can be passed into the core of exec.

    Neither the allocation of struct binprm nor the unsharing depend upon each
    other so swapping the order in which they are called is trivially safe.

    To keep things consistent the order of cleanup at the end of
    do_execve_common swapped to match the order of initialization.

    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2020

1 commit

  • Now that the last callser has been removed remove this code from exec.

    For anyone thinking of resurrecing do_execve_file please note that
    the code was buggy in several fundamental ways.

    - It did not ensure the file it was passed was read-only and that
    deny_write_access had been called on it. Which subtlely breaks
    invaniants in exec.

    - The caller of do_execve_file was expected to hold and put a
    reference to the file, but an extra reference for use by exec was
    not taken so that when exec put it's reference to the file an
    underflow occured on the file reference count.

    - The point of the interface was so that a pathname did not need to
    exist. Which breaks pathname based LSMs.

    Tetsuo Handa originally reported these issues[1]. While it was clear
    that deny_write_access was missing the fundamental incompatibility
    with the passed in O_RDWR filehandle was not immediately recognized.

    All of these issues were fixed by modifying the usermode driver code
    to have a path, so it did not need this hack.

    Reported-by: Tetsuo Handa
    [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
    v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

1 commit