08 Apr, 2022

4 commits

  • commit 390031c942116d4733310f0684beb8db19885fe6 upstream.

    Matthew Wilcox reported that there is a missing mmap_lock in
    file_files_note that could possibly lead to a user after free.

    Solve this by using the existing vma snapshot for consistency
    and to avoid the need to take the mmap_lock anywhere in the
    coredump code except for dump_vma_snapshot.

    Update the dump_vma_snapshot to capture vm_pgoff and vm_file
    that are neeeded by fill_files_note.

    Add free_vma_snapshot to free the captured values of vm_file.

    Reported-by: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org
    Cc: stable@vger.kernel.org
    Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
    Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files")
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 9ec7d3230717b4fe9b6c7afeb4811909c23fa1d7 upstream.

    Instead of individually passing cprm->siginfo and cprm->regs
    into fill_note_info pass all of struct coredump_params.

    This is preparation to allow fill_files_note to use the existing
    vma snapshot.

    Reviewed-by: Jann Horn
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 95c5436a4883841588dae86fb0b9325f47ba5ad3 upstream.

    Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the
    individual coredump routines into do_coredump itself. This makes
    the code less error prone and easier to maintain.

    Make the vma snapshot available to the coredump routines
    in struct coredump_params. This makes it easier to
    change and update what is captures in the vma snapshot
    and will be needed for fixing fill_file_notes.

    Reviewed-by: Jann Horn
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • [ Upstream commit 0da1d5002745cdc721bc018b582a8a9704d56c42 ]

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921

    As pointed out in the discussion of buglink, we cannot calculate AT_PHDR
    as the sum of load_addr and exec->e_phoff.

    : The AT_PHDR of ELF auxiliary vectors should point to the memory address
    : of program header. But binfmt_elf.c calculates this address as follows:
    :
    : NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
    :
    : which is wrong since e_phoff is the file offset of program header and
    : load_addr is the memory base address from PT_LOAD entry.
    :
    : The ld.so uses AT_PHDR as the memory address of program header. In normal
    : case, since the e_phoff is usually 64 and in the first PT_LOAD region, it
    : is the correct program header address.
    :
    : But if the address of program header isn't equal to the first PT_LOAD
    : address + e_phoff (e.g. Put the program header in other non-consecutive
    : PT_LOAD region), ld.so will try to read program header from wrong address
    : then crash or use incorrect program header.

    This is because exec->e_phoff
    is the offset of PHDRs in the file and the address of PHDRs in the
    memory may differ from it. This patch fixes the bug by calculating the
    address of program headers from PT_LOADs directly.

    Signed-off-by: Akira Kawata
    Reported-by: kernel test robot
    Acked-by: Kees Cook
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20220127124014.338760-2-akirakawata1@gmail.com
    Signed-off-by: Sasha Levin

    Akira Kawata
     

04 Oct, 2021

1 commit

  • In commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
    executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
    load_elf_interp.

    Unfortunately, this will cause kernel to fail to start with:

    1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
    Failed to execute /init (error -17)

    The reason is that the elf interpreter (ld.so) has overlapping segments.

    readelf -l ld-2.31.so
    Program Headers:
    Type Offset VirtAddr PhysAddr
    FileSiz MemSiz Flags Align
    LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x000000000002c94c 0x000000000002c94c R E 0x10000
    LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
    0x00000000000021e8 0x0000000000002320 RW 0x10000
    LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
    0x00000000000011ac 0x0000000000001328 RW 0x10000

    The reason for this problem is the same as described in commit
    ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").

    Not only executable binaries, elf interpreters (e.g. ld.so) can have
    overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
    back to MAP_FIXED in load_elf_interp.

    Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
    Cc: # v4.19
    Cc: Andrew Morton
    Cc: Michal Hocko
    Signed-off-by: Chen Jingwen
    Signed-off-by: Linus Torvalds

    Chen Jingwen
     

04 Sep, 2021

2 commits

  • At exec time when we mmap the new executable via MAP_DENYWRITE we have it
    opened via do_open_execat() and already deny_write_access()'ed the file
    successfully. Once exec completes, we allow_write_acces(); however,
    we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
    also deny_write_access() as long as mm->exe_file remains set. We'll
    effectively deny write access to our executable via mm->exe_file
    until mm->exe_file is changed -- when the process is removed, on new
    exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

    Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
    mm->exe_file.

    In case of an elf interpreter, we'll now only deny write access to the file
    during exec. This is somewhat okay, because the interpreter behaves
    (and sometime is) a shared library; all shared libraries, especially the
    ones loaded directly in user space like via dlopen() won't ever be mapped
    via MAP_DENYWRITE, because we ignore that from user space completely;
    these shared libraries can always be modified while mapped and executed.
    Let's only special-case the main executable, denying write access while
    being executed by a process. This can be considered a minor user space
    visible change.

    While this is a cleanup, it also fixes part of a problem reported with
    VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
    this patch and will be removed next:
    "Overlayfs did not honor positive i_writecount on realfile for
    VM_DENYWRITE mappings." [1]

    [1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

    Reported-by: Chengguang Xu
    Acked-by: "Eric W. Biederman"
    Acked-by: Christian König
    Signed-off-by: David Hildenbrand

    David Hildenbrand
     
  • uselib() is the legacy systemcall for loading shared libraries.
    Nowadays, applications use dlopen() to load shared libraries, completely
    implemented in user space via mmap().

    For example, glibc uses MAP_COPY to mmap shared libraries. While this
    maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
    MAP_DENYWRITE specification from user space in mmap.

    With this change, all remaining in-tree users of MAP_DENYWRITE use it
    to map an executable. We will be able to open shared libraries loaded
    via uselib() writable, just as we already can via dlopen() from user
    space.

    This is one step into the direction of removing MAP_DENYWRITE from the
    kernel. This can be considered a minor user space visible change.

    Acked-by: "Eric W. Biederman"
    Acked-by: Christian König
    Signed-off-by: David Hildenbrand

    David Hildenbrand
     

30 Jun, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "191 patches.

    Subsystems affected by this patch series: kthread, ia64, scripts,
    ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
    slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
    mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
    pagealloc, and memory-failure)"

    * emailed patches from Andrew Morton : (191 commits)
    mm,hwpoison: make get_hwpoison_page() call get_any_page()
    mm,hwpoison: send SIGBUS with error virutal address
    mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
    mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
    mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
    mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
    docs: remove description of DISCONTIGMEM
    arch, mm: remove stale mentions of DISCONIGMEM
    mm: remove CONFIG_DISCONTIGMEM
    m68k: remove support for DISCONTIGMEM
    arc: remove support for DISCONTIGMEM
    arc: update comment about HIGHMEM implementation
    alpha: remove DISCONTIGMEM and NUMA
    mm/page_alloc: move free_the_page
    mm/page_alloc: fix counting of managed_pages
    mm/page_alloc: improve memmap_pages dbg msg
    mm: drop SECTION_SHIFT in code comments
    mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
    mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
    mm/page_alloc: scale the number of pages that are batch freed
    ...

    Linus Torvalds
     
  • Ever since commit e9714acf8c43 ("mm: kill vma flag VM_EXECUTABLE and
    mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
    essentially completely ignored. Let's remove all usage of MAP_EXECUTABLE.

    [akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]

    Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: "Eric W. Biederman"
    Reviewed-by: Kees Cook
    Cc: Alexander Shishkin
    Cc: Alexander Viro
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Don Zickus
    Cc: Feng Tang
    Cc: Greg Ungerer
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Kevin Brodsky
    Cc: Mark Rutland
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

18 Jun, 2021

1 commit

  • Change the type and name of task_struct::state. Drop the volatile and
    shrink it to an 'unsigned int'. Rename it in order to find all uses
    such that we can use READ_ONCE/WRITE_ONCE as appropriate.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Will Deacon
    Acked-by: Daniel Thompson
    Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

    Peter Zijlstra
     

08 Mar, 2021

1 commit

  • have dump_skip() just remember how much needs to be skipped,
    leave actual seeks/writing zeroes to the next dump_emit()
    or the end of coredump output, whichever comes first.
    And instead of playing with do_truncate() in the end, just
    write one NUL at the end of the last gap (if any).

    Signed-off-by: Al Viro

    Al Viro
     

22 Feb, 2021

1 commit

  • Pull parisc updates from Helge Deller:

    - Optimize parisc page table locks by using the existing
    page_table_lock

    - Export argv0-preserve flag in binfmt_misc for usage in qemu-user

    - Fix interrupt table (IVT) checksum so firmware will call crash
    handler (HPMC)

    - Increase IRQ stack to 64kb on 64-bit kernel

    - Switch to common devmem_is_allowed() implementation

    - Minor fix to get_whan()

    * 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    binfmt_misc: pass binfmt_misc flags to the interpreter
    parisc: Optimize per-pagetable spinlocks
    parisc: Replace test_ti_thread_flag() with test_tsk_thread_flag()
    parisc: Bump 64-bit IRQ stack size to 64 KB
    parisc: Fix IVT checksum calculation wrt HPMC
    parisc: Use the generic devmem_is_allowed()
    parisc: Drop out of get_whan() if task is running again

    Linus Torvalds
     

16 Feb, 2021

1 commit

  • It can be useful to the interpreter to know which flags are in use.

    For instance, knowing if the preserve-argv[0] is in use would
    allow to skip the pathname argument.

    This patch uses an unused auxiliary vector, AT_FLAGS, to add a
    flag to inform interpreter if the preserve-argv[0] is enabled.

    Note by Helge Deller:
    The real-world user of this patch is qemu-user, which needs to know
    if it has to preserve the argv[0]. See Debian bug #970460.

    Signed-off-by: Laurent Vivier
    Reviewed-by: YunQiang Su
    URL: http://bugs.debian.org/970460
    Signed-off-by: Helge Deller

    Laurent Vivier
     

06 Jan, 2021

1 commit

  • Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
    the beginning of compat_elf_prstatus, take these fields into a separate
    structure (compat_elf_prstatus_common), so that it could be reused. Due to
    the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
    need the same shape change done to native struct elf_prstatus, gathering the
    fields prior to pr_reg into a new structure (struct elf_prstatus_common).

    Fortunately, offset of pr_reg is always a multiple of 16 with no padding
    right before it, so it's possible to turn all the stuff prior to it into
    a single member without disturbing the layout.

    [build fix from Geert Uytterhoeven folded in]

    Signed-off-by: Al Viro

    Al Viro
     

05 Jan, 2021

1 commit

  • On 64bit architectures that support 32bit processes there are
    two possible layouts for NT_PRSTATUS note in ELF coredumps.
    For one thing, several fields are 64bit for native processes
    and 32bit for compat ones (pr_sigpend, etc.). For another,
    the register dump is obviously different - the size and number
    of registers are not going to be the same for 32bit and 64bit
    variants of processor.

    Usually that's handled by having two structures - elf_prstatus
    for native layout and compat_elf_prstatus for 32bit one.
    32bit processes are handled by fs/compat_binfmt_elf.c, which
    defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
    Then it includes fs/binfmt_elf.c, which makes all references to
    struct elf_prstatus to be textually replaced with struct
    compat_elf_prstatus. Ugly and somewhat brittle, but it works.

    However, amd64 is worse - there are _three_ possible layouts.
    One for native 64bit processes, another for i386 (32bit) processes
    and yet another for x32 (32bit address space with full 64bit
    registers).

    Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
    with usual compat_binfmt_elf.c trickery. However, the layouts
    for i386 and x32 are not identical - they have the common beginning,
    but the register dump part (pr_reg) is bigger on x32. Worse, pr_reg
    is not the last field - it's followed by int pr_fpvalid, so that
    field ends up at different offsets for i386 and x32 layouts.

    Fortunately, there's not much code that cares about any of that -
    it's all encapsulated in fill_thread_core_info(). Since x32
    variant is bigger, we define compat_elf_prstatus to match that
    layout. That way i386 processes have enough space to fit
    their layout into.

    Moreover, since these layouts are identical prior to pr_reg,
    we don't need to distinguish x32 and i386 cases when we are
    setting the fields prior to pr_reg.

    Filling pr_reg itself is done by calling ->get() method of
    appropriate regset, and that method knows what layout (and size)
    to use.

    We do need to distinguish x32 and i386 cases only for two
    things: setting ->pr_fpvalid (offset differs for x32 and
    i386) and choosing the right size for our note.

    The way it's done is Not Nice, for the lack of more accurate
    printable description. There are two macros (PRSTATUS_SIZE and
    SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
    and (S)->pr_fpvalid = 1. On x86 asm/compat.h provides its own
    variants.

    Unfortunately, quite a few things go wrong there:
    * PRSTATUS_SIZE doesn't use the normal test for process
    being an x32 one; it compares the size reported by regset with
    the size of pr_reg.
    * it hardcodes the sizes of x32 and i386 variants (296 and 144
    resp.), so if some change in includes leads to asm/compat.h pulled
    in by fs/binfmt_elf.c we are in trouble - it will end up using
    the size of x32 variant for 64bit processes.
    * it's in the wrong place; asm/compat.h couldn't define
    the structure for i386 layout, since it lacks quite a few types
    needed for it. Hardcoded sizes are largely due to that.

    The proper fix would be to have an explicitly defined i386 variant
    of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
    TIF_X32 to choose the variant that should be used. Unfortunately,
    that requires some manipulations of headers; we'll do that later
    in the series, but for now let's go with the minimal variant -
    rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
    have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
    and use the normal TIF_X32 check in that macro. The size of i386 variant
    is kept hardcoded for now. Similar story for SET_PR_FPVALID.

    Signed-off-by: Al Viro

    Al Viro
     

16 Dec, 2020

1 commit

  • …biederm/user-namespace

    Pull execve updates from Eric Biederman:
    "This set of changes ultimately fixes the interaction of posix file
    lock and exec. Fundamentally most of the change is just moving where
    unshare_files is called during exec, and tweaking the users of
    files_struct so that the count of files_struct is not unnecessarily
    played with.

    Along the way fcheck and related helpers were renamed to more
    accurately reflect what they do.

    There were also many other small changes that fell out, as this is the
    first time in a long time much of this code has been touched.

    Benchmarks haven't turned up any practical issues but Al Viro has
    observed a possibility for a lot of pounding on task_lock. So I have
    some changes in progress to convert put_files_struct to always rcu
    free files_struct. That wasn't ready for the merge window so that will
    have to wait until next time"

    * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    exec: Move io_uring_task_cancel after the point of no return
    coredump: Document coredump code exclusively used by cell spufs
    file: Remove get_files_struct
    file: Rename __close_fd_get_file close_fd_get_file
    file: Replace ksys_close with close_fd
    file: Rename __close_fd to close_fd and remove the files parameter
    file: Merge __alloc_fd into alloc_fd
    file: In f_dupfd read RLIMIT_NOFILE once.
    file: Merge __fd_install into fd_install
    proc/fd: In fdinfo seq_show don't use get_files_struct
    bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
    proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
    file: Implement task_lookup_next_fd_rcu
    kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
    proc/fd: In tid_fd_mode use task_lookup_fd_rcu
    file: Implement task_lookup_fd_rcu
    file: Rename fcheck lookup_fd_rcu
    file: Replace fcheck_files with files_lookup_fd_rcu
    file: Factor files_lookup_fd_locked out of fcheck_files
    file: Rename __fcheck_files to files_lookup_fd_raw
    ...

    Linus Torvalds
     

15 Dec, 2020

1 commit

  • Pull x86 cleanups from Borislav Petkov:
    "Another branch with a nicely negative diffstat, just the way I
    like 'em:

    - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
    the end (Gabriel Krisman Bertazi)

    - All kinds of minor cleanups all over the tree"

    * tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    x86/ia32_signal: Propagate __user annotation properly
    x86/alternative: Update text_poke_bp() kernel-doc comment
    x86/PCI: Make a kernel-doc comment a normal one
    x86/asm: Drop unused RDPID macro
    x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
    x86/head64: Remove duplicate include
    x86/mm: Declare 'start' variable where it is used
    x86/head/64: Remove unused GET_CR2_INTO() macro
    x86/boot: Remove unused finalize_identity_maps()
    x86/uaccess: Document copy_from_user_nmi()
    x86/dumpstack: Make show_trace_log_lvl() static
    x86/mtrr: Fix a kernel-doc markup
    x86/setup: Remove unused MCA variables
    x86, libnvdimm/test: Remove COPY_MC_TEST
    x86: Reclaim TIF_IA32 and TIF_X32
    x86/mm: Convert mmu context ia32_compat into a proper flags field
    x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
    elf: Expose ELF header on arch_setup_additional_pages()
    x86/elf: Use e_machine to select start_thread for x32
    elf: Expose ELF header in compat_start_thread()
    ...

    Linus Torvalds
     

11 Dec, 2020

1 commit

  • Oleg Nesterov recently asked[1] why is there an unshare_files in
    do_coredump. After digging through all of the callers of lookup_fd it
    turns out that it is
    arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
    that needs the unshare_files in do_coredump.

    Looking at the history[2] this code was also the only piece of coredump code
    that required the unshare_files when the unshare_files was added.

    Looking at that code it turns out that cell is also the only
    architecture that implements elf_coredump_extra_notes_size and
    elf_coredump_extra_notes_write.

    I looked at the gdb repo[3] support for cell has been removed[4] in binutils
    2.34. Geoff Levand reports he is still getting questions on how to
    run modern kernels on the PS3, from people using 3rd party firmware so
    this code is not dead. According to Wikipedia the last PS3 shipped in
    Japan sometime in 2017. So it will probably be a little while before
    everyone's hardware dies.

    Add some comments briefly documenting the coredump code that exists
    only to support cell spufs to make it easier to understand the
    coredump code. Eventually the hardware will be dead, or their won't
    be userspace tools, or the coredump code will be refactored and it
    will be too difficult to update a dead architecture and these comments
    make it easy to tell where to pull to remove cell spufs support.

    [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
    [2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared")
    [3] git://sourceware.org/git/binutils-gdb.git
    [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
    Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

30 Oct, 2020

1 commit

  • There is a regular need in the kernel to provide a way to declare having a
    dynamically sized set of trailing elements in a structure. Kernel code should
    always use “flexible array members”[1] for these cases. The older style of
    one-element or zero-length arrays should no longer be used[2].

    [1] https://en.wikipedia.org/wiki/Flexible_array_member
    [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

26 Oct, 2020

2 commits

  • Like it is done for SET_PERSONALITY with ARM, which requires the ELF
    header to select correct personality parameters, x86 requires the
    headers when selecting which VDSO to load, instead of relying on the
    going-away TIF_IA32/X32 flags.

    Add an indirection macro to arch_setup_additional_pages(), that x86 can
    reimplement to receive the extra parameter just for ELF files. This
    requires no changes to other architectures, who can continue to use the
    original arch_setup_additional_pages for ELF and non-ELF binaries.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20201004032536.1229030-8-krisman@collabora.com

    Gabriel Krisman Bertazi
     
  • Like it is done for SET_PERSONALITY with x86, which requires the ELF header
    to select correct personality parameters, x86 requires the headers on
    compat_start_thread() to choose starting CS for ELF32 binaries, instead of
    relying on the going-away TIF_IA32/X32 flags.

    Add an indirection macro to ELF invocations of START_THREAD, that x86 can
    reimplement to receive the extra parameter just for ELF files. This
    requires no changes to other architectures who don't need the header
    information, they can continue to use the original start_thread for ELF and
    non-ELF binaries, and it prevents affecting non-ELF code paths for x86.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20201004032536.1229030-6-krisman@collabora.com

    Gabriel Krisman Bertazi
     

19 Oct, 2020

1 commit

  • create_elf_tables() runs after setup_new_exec(), so other tasks can
    already access our new mm and do things like process_madvise() on it. (At
    the time I'm writing this commit, process_madvise() is not in mainline
    yet, but has been in akpm's tree for some time.)

    While I believe that there are currently no APIs that would actually allow
    another process to mess up our VMA tree (process_madvise() is limited to
    MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
    under which no syscalls have been executed yet), this seems like an
    accident waiting to happen.

    Let's make sure that we always take the mmap lock around GUP paths as long
    as another process might be able to see the mm.

    (Yes, this diff looks suspicious because we drop the lock before doing
    anything with `vma`, but that's because we actually don't do anything with
    it apart from the NULL check.)

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Michel Lespinasse
    Cc: "Eric W . Biederman"
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Mauro Carvalho Chehab
    Cc: Sakari Ailus
    Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     

17 Oct, 2020

4 commits

  • In both binfmt_elf and binfmt_elf_fdpic, use a new helper
    dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
    VMA, if we have one) while protected by the mmap_lock, and then use that
    snapshot instead of walking the VMA list without locking.

    An alternative approach would be to keep the mmap_lock held across the
    entire core dumping operation; however, keeping the mmap_lock locked while
    we may be blocked for an unbounded amount of time (e.g. because we're
    dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
    blocks things like the ->release handler of userfaultfd, and we don't
    really want critical system daemons to grind to a halt just because
    someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
    something like that.

    Since both the normal ELF code and the FDPIC ELF code need this
    functionality (and if any other binfmt wants to add coredump support in
    the future, they'd probably need it, too), implement this with a common
    helper in fs/coredump.c.

    A downside of this approach is that we now need a bigger amount of kernel
    memory per userspace VMA in the normal ELF case, and that we need O(n)
    kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
    be terribly bad.

    There currently is a data race between stack expansion and anything that
    reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
    mitigate that for core dumping, take the mmap_lock in write mode when
    taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
    read mode, we could end up with a corrupted core dump if someone does
    get_user_pages_remote() concurrently. Not really a major problem, but
    taking the mmap_lock either way works here, so we might as well avoid the
    issue.) (This doesn't do anything about the existing data races with stack
    expansion in other mm code.)

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
    different code to figure out which VMAs should be dumped, and if so,
    whether the dump should contain the entire VMA or just its first page.

    Eliminate duplicate code by reworking the binfmt_elf version into a
    generic core dumping helper in coredump.c.

    As part of that, change the heuristic for detecting executable/library
    header pages to check whether the inode is executable instead of looking
    at the file mode.

    This is less problematic in terms of locking because it lets us avoid
    get_user() under the mmap_sem. (And arguably it looks nicer and makes
    more sense in generic code.)

    Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
    only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
    has been written to.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
    pages into the coredump file. Extract that logic into a common helper.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Patch series "Selecting Load Addresses According to p_align", v3.

    The current ELF loading mechancism provides page-aligned mappings. This
    can lead to the program being loaded in a way unsuitable for file-backed,
    transparent huge pages when handling PIE executables.

    While specifying -z,max-page-size=0x200000 to the linker will generate
    suitably aligned segments for huge pages on x86_64, the executable needs
    to be loaded at a suitably aligned address as well. This alignment
    requires the binary's cooperation, as distinct segments need to be
    appropriately paddded to be eligible for THP.

    For binaries built with increased alignment, this limits the number of
    bits usable for ASLR, but provides some randomization over using fixed
    load addresses/non-PIE binaries.

    This patch (of 2):

    The current ELF loading mechancism provides page-aligned mappings. This
    can lead to the program being loaded in a way unsuitable for file-backed,
    transparent huge pages when handling PIE executables.

    For binaries built with increased alignment, this limits the number of
    bits usable for ASLR, but provides some randomization over using fixed
    load addresses/non-PIE binaries.

    Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.

    [akpm@linux-foundation.org: fix max() warning]
    [ckennelly@google.com: augment comment]
    Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com

    Signed-off-by: Chris Kennelly
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Song Liu
    Cc: David Rientjes
    Cc: Ian Rogers
    Cc: Hugh Dickens
    Cc: Suren Baghdasaryan
    Cc: Sandeep Patil
    Cc: Fangrui Song
    Cc: Nick Desaulniers
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
    Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
    Signed-off-by: Linus Torvalds

    Chris Kennelly
     

28 Jul, 2020

2 commits

  • all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
    been defined on any architecture since 2010

    Signed-off-by: Al Viro

    Al Viro
     
  • Two new helpers: given a process and regset, dump into a buffer.
    regset_get() takes a buffer and size, regset_get_alloc() takes size
    and allocates a buffer.

    Return value in both cases is the amount of data actually dumped in
    case of success or -E... on error.

    In both cases the size is capped by regset->n * regset->size, so
    ->get() is called with offset 0 and size no more than what regset
    expects.

    binfmt_elf.c callers of ->get() are switched to using those; the other
    caller (copy_regset_to_user()) will need some preparations to switch.

    Signed-off-by: Al Viro

    Al Viro
     

11 Jun, 2020

1 commit

  • Pull misc uaccess updates from Al Viro:
    "Assorted uaccess patches for this cycle - the stuff that didn't fit
    into thematic series"

    * 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
    x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
    user_regset_copyout_zero(): use clear_user()
    TEST_ACCESS_OK _never_ had been checked anywhere
    x86: switch cp_stat64() to unsafe_put_user()
    binfmt_flat: don't use __put_user()
    binfmt_elf_fdpic: don't use __... uaccess primitives
    binfmt_elf: don't bother with __{put,copy_to}_user()
    pselect6() and friends: take handling the combined 6th/7th args into helper

    Linus Torvalds
     

05 Jun, 2020

3 commits

  • Merge yet more updates from Andrew Morton:

    - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
    __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

    - Various other little subsystems

    * emailed patches from Andrew Morton : (127 commits)
    lib/ubsan.c: fix gcc-10 warnings
    tools/testing/selftests/vm: remove duplicate headers
    selftests: vm: pkeys: fix multilib builds for x86
    selftests: vm: pkeys: use the correct page size on powerpc
    selftests/vm/pkeys: override access right definitions on powerpc
    selftests/vm/pkeys: test correct behaviour of pkey-0
    selftests/vm/pkeys: introduce a sub-page allocator
    selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
    selftests/vm/pkeys: associate key on a mapped page and detect write violation
    selftests/vm/pkeys: associate key on a mapped page and detect access violation
    selftests/vm/pkeys: improve checks to determine pkey support
    selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
    selftests/vm/pkeys: fix number of reserved powerpc pkeys
    selftests/vm/pkeys: introduce powerpc support
    selftests/vm/pkeys: introduce generic pkey abstractions
    selftests: vm: pkeys: use the correct huge page size
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
    selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
    selftests/vm/pkeys: fix pkey_disable_clear()
    selftests: vm: pkeys: add helpers for pkey bits
    ...

    Linus Torvalds
     
  • The ifndef was added a long time ago to support archs that would define
    their own mapping function. The last user was the metag arch which was
    removed from the tree, and as such there are no users left. Let's kill
    it.

    Signed-off-by: Anthony Iliopoulos
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com
    Signed-off-by: Linus Torvalds

    Anthony Iliopoulos
     
  • Pull execve updates from Eric Biederman:
    "Last cycle for the Nth time I ran into bugs and quality of
    implementation issues related to exec that could not be easily be
    fixed because of the way exec is implemented. So I have been digging
    into exec and cleanup up what I can.

    I don't think I have exec sorted out enough to fix the issues I
    started with but I have made some headway this cycle with 4 sets of
    changes.

    - promised cleanups after introducing exec_update_mutex

    - trivial cleanups for exec

    - control flow simplifications

    - remove the recomputation of bprm->cred

    The net result is code that is a bit easier to understand and work
    with and a decrease in the number of lines of code (if you don't count
    the added tests)"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
    exec: Compute file based creds only once
    exec: Add a per bprm->file version of per_clear
    binfmt_elf_fdpic: fix execfd build regression
    selftests/exec: Add binfmt_script regression test
    exec: Remove recursion from search_binary_handler
    exec: Generic execfd support
    exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
    exec: Move the call of prepare_binprm into search_binary_handler
    exec: Allow load_misc_binary to call prepare_binprm unconditionally
    exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
    exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
    exec: Teach prepare_exec_creds how exec treats uids & gids
    exec: Set the point of no return sooner
    exec: Move handling of the point of no return to the top level
    exec: Run sync_mm_rss before taking exec_update_mutex
    exec: Fix spelling of search_binary_handler in a comment
    exec: Move the comment from above de_thread to above unshare_sighand
    exec: Rename flush_old_exec begin_new_exec
    exec: Move most of setup_new_exec into flush_old_exec
    exec: In setup_new_exec cache current in the local variable me
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit


02 Jun, 2020

2 commits

  • Pull uaccess/coredump updates from Al Viro:
    "set_fs() removal in coredump-related area - mostly Christoph's
    stuff..."

    * 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
    binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
    binfmt_elf: remove the set_fs in fill_siginfo_note
    signal: refactor copy_siginfo_to_user32
    powerpc/spufs: simplify spufs core dumping
    powerpc/spufs: stop using access_ok
    powerpc/spufs: fix copy_to_user while atomic

    Linus Torvalds
     
  • Pull arm64 updates from Will Deacon:
    "A sizeable pile of arm64 updates for 5.8.

    Summary below, but the big two features are support for Branch Target
    Identification and Clang's Shadow Call stack. The latter is currently
    arm64-only, but the high-level parts are all in core code so it could
    easily be adopted by other architectures pending toolchain support

    Branch Target Identification (BTI):

    - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
    branch targets to limit the types of branch from which they can be
    called and additionally prevents branching to arbitrary code,
    although kernel support requires a very recent toolchain.

    - Function annotation via SYM_FUNC_START() so that assembly functions
    are wrapped with the relevant "landing pad" instructions.

    - BPF and vDSO updates to use the new instructions.

    - Addition of a new HWCAP and exposure of BTI capability to userspace
    via ID register emulation, along with ELF loader support for the
    BTI feature in .note.gnu.property.

    - Non-critical fixes to CFI unwind annotations in the sigreturn
    trampoline.

    Shadow Call Stack (SCS):

    - Support for Clang's Shadow Call Stack feature, which reserves
    platform register x18 to point at a separate stack for each task
    that holds only return addresses. This protects function return
    control flow from buffer overruns on the main stack.

    - Save/restore of x18 across problematic boundaries (user-mode,
    hypervisor, EFI, suspend, etc).

    - Core support for SCS, should other architectures want to use it
    too.

    - SCS overflow checking on context-switch as part of the existing
    stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

    CPU feature detection:

    - Removed numerous "SANITY CHECK" errors when running on a system
    with mismatched AArch32 support at EL1. This is primarily a concern
    for KVM, which disabled support for 32-bit guests on such a system.

    - Addition of new ID registers and fields as the architecture has
    been extended.

    Perf and PMU drivers:

    - Minor fixes and cleanups to system PMU drivers.

    Hardware errata:

    - Unify KVM workarounds for VHE and nVHE configurations.

    - Sort vendor errata entries in Kconfig.

    Secure Monitor Call Calling Convention (SMCCC):

    - Update to the latest specification from Arm (v1.2).

    - Allow PSCI code to query the SMCCC version.

    Software Delegated Exception Interface (SDEI):

    - Unexport a bunch of unused symbols.

    - Minor fixes to handling of firmware data.

    Pointer authentication:

    - Add support for dumping the kernel PAC mask in vmcoreinfo so that
    the stack can be unwound by tools such as kdump.

    - Simplification of key initialisation during CPU bringup.

    BPF backend:

    - Improve immediate generation for logical and add/sub instructions.

    vDSO:

    - Minor fixes to the linker flags for consistency with other
    architectures and support for LLVM's unwinder.

    - Clean up logic to initialise and map the vDSO into userspace.

    ACPI:

    - Work around for an ambiguity in the IORT specification relating to
    the "num_ids" field.

    - Support _DMA method for all named components rather than only PCIe
    root complexes.

    - Minor other IORT-related fixes.

    Miscellaneous:

    - Initialise debug traps early for KGDB and fix KDB cacheflushing
    deadlock.

    - Minor tweaks to early boot state (documentation update, set
    TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

    - Refactoring and cleanup"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
    KVM: arm64: Check advertised Stage-2 page size capability
    arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
    ACPI/IORT: Remove the unused __get_pci_rid()
    arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
    arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
    arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
    arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
    arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
    arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
    arm64/cpufeature: Introduce ID_MMFR5 CPU register
    arm64/cpufeature: Introduce ID_DFR1 CPU register
    arm64/cpufeature: Introduce ID_PFR2 CPU register
    arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
    arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
    arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
    arm64: mm: Add asid_gen_match() helper
    firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
    arm64: vdso: Fix CFI directives in sigreturn trampoline
    arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
    ...

    Linus Torvalds
     

29 May, 2020

1 commit

  • KMSAN reported uninitialized data being written to disk when dumping
    core. As a result, several kilobytes of kmalloc memory may be written
    to the core file and then read by a non-privileged user.

    Reported-by: sam
    Signed-off-by: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc:
    Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com
    Link: https://github.com/google/kmsan/issues/76
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

21 May, 2020

1 commit

  • Most of the support for passing the file descriptor of an executable
    to an interpreter already lives in the generic code and in binfmt_elf.
    Rework the fields in binfmt_elf that deal with executable file
    descriptor passing to make executable file descriptor passing a first
    class concept.

    Move the fd_install from binfmt_misc into begin_new_exec after the new
    creds have been installed. This means that accessing the file through
    /proc//fd/N is able to see the creds for the new executable
    before allowing access to the new executables files.

    Performing the install of the executables file descriptor after
    the point of no return also means that nothing special needs to
    be done on error. The exiting of the process will close all
    of it's open files.

    Move the would_dump from binfmt_misc into begin_new_exec right
    after would_dump is called on the bprm->file. This makes it
    obvious this case exists and that no nesting of bprm->file is
    currently supported.

    In binfmt_misc the movement of fd_install into generic code means
    that it's special error exit path is no longer needed.

    Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 May, 2020

2 commits

  • There is and has been for a very long time been a lot more going on in
    flush_old_exec than just flushing the old state. After the movement
    of code from setup_new_exec there is a whole lot more going on than
    just flushing the old executables state.

    Rename flush_old_exec to begin_new_exec to more accurately reflect
    what this function does.

    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The two functions are now always called one right after the
    other so merge them together to make future maintenance easier.

    Reviewed-by: Kees Cook
    Reviewed-by: Greg Ungerer
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 May, 2020

1 commit