01 Feb, 2020

2 commits

  • Pull updates from Andrew Morton:
    "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
    ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

    MM is fairly quiet this time. Holidays, I assume"

    * emailed patches from Andrew Morton : (118 commits)
    kcov: ignore fault-inject and stacktrace
    include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
    execve: warn if process starts with executable stack
    reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
    init/main.c: fix misleading "This architecture does not have kernel memory protection" message
    init/main.c: fix quoted value handling in unknown_bootoption
    init/main.c: remove unnecessary repair_env_string in do_initcall_level
    init/main.c: log arguments and environment passed to init
    fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
    fs/binfmt_elf.c: coredump: delete duplicated overflow check
    fs/binfmt_elf.c: coredump: allocate core ELF header on stack
    fs/binfmt_elf.c: make BAD_ADDR() unlikely
    fs/binfmt_elf.c: better codegen around current->mm
    fs/binfmt_elf.c: don't copy ELF header around
    fs/binfmt_elf.c: fix ->start_code calculation
    fs/binfmt_elf.c: smaller code generation around auxv vector fill
    lib/find_bit.c: uninline helper _find_next_bit()
    lib/find_bit.c: join _find_next_bit{_le}
    uapi: rename ext2_swab() to swab() and share globally in swab.h
    lib/scatterlist.c: adjust indentation in __sg_alloc_table
    ...

    Linus Torvalds
     
  • There were few episodes of silent downgrade to an executable stack over
    years:

    1) linking innocent looking assembly file will silently add executable
    stack if proper linker options is not given as well:

    $ cat f.S
    .intel_syntax noprefix
    .text
    .globl f
    f:
    ret

    $ cat main.c
    void f(void);
    int main(void)
    {
    f();
    return 0;
    }

    $ gcc main.c f.S
    $ readelf -l ./a.out
    GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 RWE 0x10
    ^^^

    2) converting C99 nested function into a closure
    https://nullprogram.com/blog/2019/11/15/

    void intsort2(int *base, size_t nmemb, _Bool invert)
    {
    int cmp(const void *a, const void *b)
    {
    int r = *(int *)a - *(int *)b;
    return invert ? -r : r;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
    }

    will silently require stack trampolines while non-closure version will
    not.

    Without doubt this behaviour is documented somewhere, add a warning so
    that developers and users can at least notice. After so many years of
    x86_64 having proper executable stack support it should not cause too
    many problems.

    Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Dan Carpenter
    Cc: Will Deacon
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

31 Jan, 2020

1 commit

  • Pull x86 MPX removal from Dave Hansen:
    "MPX requires recompiling applications, which requires compiler
    support. Unfortunately, GCC 9.1 is expected to be be released without
    support for MPX. This means that there was only a relatively small
    window where folks could have ever used MPX. It failed to gain wide
    adoption in the industry, and Linux was the only mainstream OS to ever
    support it widely.

    Support for the feature may also disappear on future processors.

    This set completes the process that we started during the 5.4 merge
    window when the MPX prctl()s were removed. XSAVE support is left in
    place, which allows MPX-using KVM guests to continue to function"

    * tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx:
    x86/mpx: remove MPX from arch/x86
    mm: remove arch_bprm_mm_init() hook
    x86/mpx: remove bounds exception code
    x86/mpx: remove build infrastructure
    x86/alternatives: add missing insn.h include

    Linus Torvalds
     

24 Jan, 2020

1 commit

  • From: Dave Hansen

    MPX is being removed from the kernel due to a lack of support
    in the toolchain going forward (gcc).

    arch_bprm_mm_init() is used at execve() time. The only non-stub
    implementation is on x86 for MPX. Remove the hook entirely from
    all architectures and generic code.

    Cc: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: x86@kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Anton Ivanov
    Cc: Guan Xuetao
    Cc: Andrew Morton
    Signed-off-by: Dave Hansen

    Dave Hansen
     

04 Dec, 2019

1 commit

  • Pull timer updates from Ingo Molnar:
    "The main changes in the timer code in this cycle were:

    - Clockevent updates:

    - timer-of framework cleanups. (Geert Uytterhoeven)

    - Use timer-of for the renesas-ostm and the device name to prevent
    name collision in case of multiple timers. (Geert Uytterhoeven)

    - Check if there is an error after calling of_clk_get in asm9260
    (Chuhong Yuan)

    - ABI fix: Zero out high order bits of nanoseconds on compat
    syscalls. This got broken a year ago, with apparently no side
    effects so far.

    Since the kernel would use random data otherwise I don't think we'd
    have other options but to fix the bug, even if there was a side
    effect to applications (Dmitry Safonov)

    - Optimize ns_to_timespec64() on 32-bit systems: move away from
    div_s64_rem() which can be slow, to div_u64_rem() which is faster
    (Arnd Bergmann)

    - Annotate KCSAN-reported false positive data races in
    hrtimer_is_queued() users by moving timer->state handling over to
    the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
    (Eric Dumazet)

    - Misc cleanups and small fixes"

    [ I undid the "ABI fix" and updated the comments instead. The reason
    there were apparently no side effects is that the fix was a no-op.

    The updated comment is to say _why_ it was a no-op. - Linus ]

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Zero the upper 32-bits in __kernel_timespec on 32-bit
    time: Rename tsk->real_start_time to ->start_boottime
    hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
    time: Fix spelling mistake in comment
    time: Optimize ns_to_timespec64()
    hrtimer: Annotate lockless access to timer->state
    clocksource/drivers/asm9260: Add a check for of_clk_get
    clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
    clocksource/drivers/renesas-ostm: Convert to timer_of
    clocksource/drivers/timer-of: Use unique device name instead of timer
    clocksource/drivers/timer-of: Convert last full_name to %pOF

    Linus Torvalds
     

01 Dec, 2019

1 commit

  • …ux/kernel/git/dhowells/linux-fs

    Pull pipe rework from David Howells:
    "This is my set of preparatory patches for building a general
    notification queue on top of pipes. It makes a number of significant
    changes:

    - It removes the nr_exclusive argument from __wake_up_sync_key() as
    this is always 1. This prepares for the next step:

    - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
    woken up from a function that's holding the poll waitqueue
    spinlock.

    - Change the pipe buffer ring to be managed in terms of unbounded
    head and tail indices rather than bounded index and length. This
    means that reading the pipe only needs to modify one index, not
    two.

    - A selection of helper functions are provided to query the state of
    the pipe buffer, plus a couple to apply updates to the pipe
    indices.

    - The pipe ring is allowed to have kernel-reserved slots. This allows
    many notification messages to be spliced in by the kernel without
    allowing userspace to pin too many pages if it writes to the same
    pipe.

    - Advance the head and tail indices inside the pipe waitqueue lock
    and use wake_up_interruptible_sync_poll_locked() to poke poll
    without having to take the lock twice.

    - Rearrange pipe_write() to preallocate the buffer it is going to
    write into and then drop the spinlock. This allows kernel
    notifications to then be added the ring whilst it is filling the
    buffer it allocated. The read side is stalled because the pipe
    mutex is still held.

    - Don't wake up readers on a pipe if there was already data in it
    when we added more.

    - Don't wake up writers on a pipe if the ring wasn't full before we
    removed a buffer"

    * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    pipe: Remove sync on wake_ups
    pipe: Increase the writer-wakeup threshold to reduce context-switch count
    pipe: Check for ring full inside of the spinlock in pipe_write()
    pipe: Remove redundant wakeup from pipe_write()
    pipe: Rearrange sequence in pipe_write() to preallocate slot
    pipe: Conditionalise wakeup in pipe_read()
    pipe: Advance tail pointer inside of wait spinlock in pipe_read()
    pipe: Allow pipes to have kernel-reserved slots
    pipe: Use head and tail pointers for the ring, not cursor and length
    Add wake_up_interruptible_sync_poll_locked()
    Remove the nr_exclusive argument from __wake_up_sync_key()
    pipe: Reduce #inclusion of pipe_fs_i.h

    Linus Torvalds
     

20 Nov, 2019

1 commit

  • mm_release() contains the futex exit handling. mm_release() is called from
    do_exit()->exit_mm() and from exec()->exec_mm().

    In the exit_mm() case PF_EXITING and the futex state is updated. In the
    exec_mm() case these states are not touched.

    As the futex exit code needs further protections against exit races, this
    needs to be split into two functions.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de

    Thomas Gleixner
     

13 Nov, 2019

1 commit


24 Oct, 2019

1 commit


25 Sep, 2019

1 commit

  • The membarrier_state field is located within the mm_struct, which
    is not guaranteed to exist when used from runqueue-lock-free iteration
    on runqueues by the membarrier system call.

    Copy the membarrier_state from the mm_struct into the scheduler runqueue
    when the scheduler switches between mm.

    When registering membarrier for mm, after setting the registration bit
    in the mm membarrier state, issue a synchronize_rcu() to ensure the
    scheduler observes the change. In order to take care of the case
    where a runqueue keeps executing the target mm without swapping to
    other mm, iterate over each runqueue and issue an IPI to copy the
    membarrier_state from the mm_struct into each runqueue which have the
    same mm which state has just been modified.

    Move the mm membarrier_state field closer to pgd in mm_struct to use
    a cache line already touched by the scheduler switch_mm.

    The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
    clear the runqueue's membarrier state in addition to clear the mm
    membarrier state, so move its implementation into the scheduler
    membarrier code so it can access the runqueue structure.

    Add memory barrier in membarrier_exec_mmap() prior to clearing
    the membarrier state, ensuring memory accesses executed prior to exec
    are not reordered with the stores clearing the membarrier state.

    As suggested by Linus, move all membarrier.c RCU read-side locks outside
    of the for each cpu loops.

    Suggested-by: Linus Torvalds
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

25 Jul, 2019

1 commit

  • When going through execve(), zero out the NUMA fault statistics instead of
    freeing them.

    During execve, the task is reachable through procfs and the scheduler. A
    concurrent /proc/*/sched reader can read data from a freed ->numa_faults
    allocation (confirmed by KASAN) and write it back to userspace.
    I believe that it would also be possible for a use-after-free read to occur
    through a race between a NUMA fault and execve(): task_numa_fault() can
    lead to task_numa_compare(), which invokes task_weight() on the currently
    running task of a different CPU.

    Another way to fix this would be to make ->numa_faults RCU-managed or add
    extra locking, but it seems easier to wipe the NUMA fault statistics on
    execve.

    Signed-off-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
    Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
    Signed-off-by: Ingo Molnar

    Jann Horn
     

09 Jul, 2019

1 commit

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     

27 May, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit


08 Mar, 2019

2 commits

  • Large enterprise clients often run applications out of networked file
    systems where the IT mandated layout of project volumes can end up
    leading to paths that are longer than 128 characters. Bumping this up
    to the next order of two solves this problem in all but the most
    egregious case while still fitting into a 512b slab.

    [oleg@redhat.com: update comment, per Kees]
    Link: http://lkml.kernel.org/r/20181112160956.GA28472@redhat.com
    Signed-off-by: Oleg Nesterov
    Reported-by: Ben Woodard
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Kees Cook
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Link: http://lkml.kernel.org/r/1548275584-18096-2-git-send-email-vgupta@synopsys.com
    Link: http://lkml.kernel.org/g/20150807115710.GA16897@redhat.com
    Signed-off-by: Vineet Gupta
    Reviewed-by: Anthony Yznaga
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Peter Zijlstra (Intel)
    Cc: Chris Wilson
    Cc: Ingo Molnar
    Cc: Jani Nikula
    Cc: Miklos Szeredi
    Cc: Theodore Ts'o
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

07 Mar, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

19 Feb, 2019

1 commit

  • syzkaller report this:
    BUG: memory leak
    unreferenced object 0xffffc9000488d000 (size 9195520):
    comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
    hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................
    02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z.....
    backtrace:
    [] __vmalloc_node mm/vmalloc.c:1795 [inline]
    [] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
    [] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
    [] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
    [] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
    [] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
    [] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    It should goto 'out_free' lable to free allocated buf while kernel_read
    fails.

    Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory")
    Signed-off-by: YueHaibing
    Signed-off-by: Al Viro

    YueHaibing
     

04 Feb, 2019

1 commit

  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable sighand_struct.count is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the sighand_struct.count it might make a difference
    in following places:

    - __cleanup_sighand: decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     

06 Jan, 2019

1 commit


05 Jan, 2019

2 commits

  • This is already done for us internally by the signal machinery.

    [akpm@linux-foundation.org: fix fs/buffer.c]
    Link: http://lkml.kernel.org/r/20181116002713.8474-7-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the
    "extra" size for argv/envp pointers every time, this is a bit ugly and
    even not strictly correct: acct_arg_size() must not account this size.

    Remove all the rlimit code in get_arg_page(). Instead, add bprm->argmin
    calculated once at the start of __do_execve_file() and change
    copy_strings to check bprm->p >= bprm->argmin.

    The patch adds the new helper, prepare_arg_pages() which initializes
    bprm->argc/envc and bprm->argmin.

    [oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()]
    Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com
    [akpm@linux-foundation.org: use max_t]
    Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Tested-by: Guenter Roeck
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Dec, 2018

1 commit


04 Dec, 2018

1 commit

  • Revert commit c22397888f1e "exec: make de_thread() freezable" as
    requested by Ingo Molnar:

    "So there's a new regression in v4.20-rc4, my desktop produces this
    lockdep splat:

    [ 1772.588771] WARNING: pkexec/4633 still has locks held!
    [ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
    [ 1772.588775] ------------------------------------
    [ 1772.588776] 1 lock held by pkexec/4633:
    [ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
    [ 1772.588786] stack backtrace:
    [ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
    [ 1772.588792] Call Trace:
    [ 1772.588800] dump_stack+0x85/0xcb
    [ 1772.588803] flush_old_exec+0x116/0x890
    [ 1772.588807] ? load_elf_phdrs+0x72/0xb0
    [ 1772.588809] load_elf_binary+0x291/0x1620
    [ 1772.588815] ? sched_clock+0x5/0x10
    [ 1772.588817] ? search_binary_handler+0x6d/0x240
    [ 1772.588820] search_binary_handler+0x80/0x240
    [ 1772.588823] load_script+0x201/0x220
    [ 1772.588825] search_binary_handler+0x80/0x240
    [ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
    [ 1772.588832] ? strncpy_from_user+0x40/0x180
    [ 1772.588835] __x64_sys_execve+0x34/0x40
    [ 1772.588838] do_syscall_64+0x60/0x1c0

    The warning gets triggered by an ancient lockdep check in the freezer:

    (gdb) list *0xffffffff812ece06
    0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
    52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
    53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
    54 */
    55 static inline bool try_to_freeze_unsafe(void)
    56 {
    57 might_sleep();
    58 if (likely(!freezing(current)))
    59 return false;
    60 return __refrigerator(false);
    61 }

    I reviewed the ->cred_guard_mutex code, and the mutex is held across all
    of exec() - and we always did this.

    But there's this recent -rc4 commit:

    > Chanho Min (1):
    > exec: make de_thread() freezable

    c22397888f1e: exec: make de_thread() freezable

    I believe this commit is bogus, you cannot call try_to_freeze() from
    de_thread(), because it's holding the ->cred_guard_mutex."

    Reported-by: Ingo Molnar
    Tested-by: Ingo Molnar
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

19 Nov, 2018

1 commit

  • Suspend fails due to the exec family of functions blocking the freezer.
    The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
    all sub-threads to die, and we have the deadlock if one of them is frozen.
    This also can occur with the schedule() waiting for the group thread leader
    to exit if it is frozen.

    In our machine, it causes freeze timeout as bellows.

    Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
    setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
    Call trace:
    [] __switch_to+0x88/0xa0
    [] __schedule+0x1bc/0x720
    [] schedule+0x40/0xa8
    [] flush_old_exec+0xdc/0x640
    [] load_elf_binary+0x2a8/0x1090
    [] search_binary_handler+0x9c/0x240
    [] load_script+0x20c/0x228
    [] search_binary_handler+0x9c/0x240
    [] do_execveat_common.isra.14+0x4f8/0x6e8
    [] compat_SyS_execve+0x38/0x48
    [] el0_svc_naked+0x24/0x28

    To fix this, make de_thread() freezable. It looks safe and works fine.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Chanho Min
    Acked-by: Oleg Nesterov
    Acked-by: Pavel Machek
    Acked-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Chanho Min
     

11 Oct, 2018

1 commit

  • On 32-bit systems, the buffer allocated by kernel_read_file() is too
    small if the file size is > SIZE_MAX, due to truncation to size_t.

    Fortunately, since the 'count' argument to kernel_read() is also
    truncated to size_t, only the allocated space is filled; then, -EIO is
    returned since 'pos != i_size' after the read loop.

    But this is not obvious and seems incidental. We should be more
    explicit about this case. So, fail early if i_size > SIZE_MAX.

    Signed-off-by: Eric Biggers
    Signed-off-by: Mimi Zohar

    Eric Biggers
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
    VMA. This is unreliable as ->mmap may not set ->vm_ops.

    False-positive vma_is_anonymous() may lead to crashes:

    next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
    prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
    pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
    flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
    ------------[ cut here ]------------
    kernel BUG at mm/memory.c:1422!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
    RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
    RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
    RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
    Call Trace:
    unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
    zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
    unmap_mapping_range_vma mm/memory.c:2792 [inline]
    unmap_mapping_range_tree mm/memory.c:2813 [inline]
    unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
    unmap_mapping_range+0x48/0x60 mm/memory.c:2880
    truncate_pagecache+0x54/0x90 mm/truncate.c:800
    truncate_setsize+0x70/0xb0 mm/truncate.c:826
    simple_setattr+0xe9/0x110 fs/libfs.c:409
    notify_change+0xf13/0x10f0 fs/attr.c:335
    do_truncate+0x1ac/0x2b0 fs/open.c:63
    do_sys_ftruncate+0x492/0x560 fs/open.c:205
    __do_sys_ftruncate fs/open.c:215 [inline]
    __se_sys_ftruncate fs/open.c:213 [inline]
    __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
    #define KCOV_ENABLE _IO('c', 100)
    #define KCOV_DISABLE _IO('c', 101)
    #define COVER_SIZE (1024<<< 20);
    return 0;
    }

    This can be fixed by assigning anonymous VMAs own vm_ops and not relying
    on it being NULL.

    If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
    dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

    Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jul, 2018

2 commits

  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jul, 2018

1 commit

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 Jun, 2018

1 commit

  • Pull restartable sequence support from Thomas Gleixner:
    "The restartable sequences syscall (finally):

    After a lot of back and forth discussion and massive delays caused by
    the speculative distraction of maintainers, the core set of
    restartable sequences has finally reached a consensus.

    It comes with the basic non disputed core implementation along with
    support for arm, powerpc and x86 and a full set of selftests

    It was exposed to linux-next earlier this week, so it does not fully
    comply with the merge window requirements, but there is really no
    point to drag it out for yet another cycle"

    * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rseq/selftests: Provide Makefile, scripts, gitignore
    rseq/selftests: Provide parametrized tests
    rseq/selftests: Provide basic percpu ops test
    rseq/selftests: Provide basic test
    rseq/selftests: Provide rseq library
    selftests/lib.mk: Introduce OVERRIDE_TARGETS
    powerpc: Wire up restartable sequences system call
    powerpc: Add syscall detection for restartable sequences
    powerpc: Add support for restartable sequences
    x86: Wire up restartable sequence system call
    x86: Add support for restartable sequences
    arm: Wire up restartable sequences system call
    arm: Add syscall detection for restartable sequences
    arm: Add restartable sequences support
    rseq: Introduce restartable sequences system call
    uapi/headers: Provide types_32_64.h

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

24 May, 2018

1 commit

  • Introduce helper:
    int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
    struct umh_info {
    struct file *pipe_to_umh;
    struct file *pipe_from_umh;
    pid_t pid;
    };

    that GPLed kernel modules (signed or unsigned) can use it to execute part
    of its own data as swappable user mode process.

    The kernel will do:
    - allocate a unique file in tmpfs
    - populate that file with [data, data + len] bytes
    - user-mode-helper code will do_execve that file and, before the process
    starts, the kernel will create two unix pipes for bidirectional
    communication between kernel module and umh
    - close tmpfs file, effectively deleting it
    - the fork_usermode_blob will return zero on success and populate
    'struct umh_info' with two unix pipes and the pid of the user process

    As the first step in the development of the bpfilter project
    the fork_usermode_blob() helper is introduced to allow user mode code
    to be invoked from a kernel module. The idea is that user mode code plus
    normal kernel module code are built as part of the kernel build
    and installed as traditional kernel module into distro specified location,
    such that from a distribution point of view, there is
    no difference between regular kernel modules and kernel modules + umh code.
    Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
    by a kernel module doesn't make it any special from kernel and user space
    tooling point of view.

    Such approach enables kernel to delegate functionality traditionally done
    by the kernel modules into the user space processes (either root or !root) and
    reduces security attack surface of the new code. The buggy umh code would crash
    the user process, but not the kernel. Another advantage is that umh code
    of the kernel module can be debugged and tested out of user space
    (e.g. opening the possibility to run clang sanitizers, fuzzers or
    user space test suites on the umh code).
    In case of the bpfilter project such architecture allows complex control plane
    to be done in the user space while bpf based data plane stays in the kernel.

    Since umh can crash, can be oom-ed by the kernel, killed by the admin,
    the kernel module that uses them (like bpfilter) needs to manage life
    time of umh on its own via two unix pipes and the pid of umh.

    The exit code of such kernel module should kill the umh it started,
    so that rmmod of the kernel module will cleanup the corresponding umh.
    Just like if the kernel module does kmalloc() it should kfree() it
    in the exit code.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

12 Apr, 2018

3 commits

  • Since the stack rlimit is used in multiple places during exec and it can
    be changed via other threads (via setrlimit()) or processes (via
    prlimit()), the assumption that the value doesn't change cannot be made.
    This leads to races with mm layout selection and argument size
    calculations. This changes the exec path to use the rlimit stored in
    bprm instead of in current. Before starting the thread, the bprm stack
    rlimit is stored back to current.

    Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Andy Lutomirski
    Reported-by: Brad Spengler
    Acked-by: Michal Hocko
    Cc: Ben Hutchings
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Provide a final callback into fs/exec.c before start_thread() takes
    over, to handle any last-minute changes, like the coming restoration of
    the stack limit.

    Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

19 Mar, 2018

1 commit

  • The LSM check should happen after the file has been confirmed to be
    unchanging. Without this, we could have a race between the Time of Check
    (the call to security_kernel_read_file() which could read the file and
    make access policy decisions) and the Time of Use (starting with
    kernel_read_file()'s reading of the file contents). In theory, file
    contents could change between the two.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Signed-off-by: James Morris

    Kees Cook
     

04 Jan, 2018

1 commit

  • This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
    for setting dumpability")

    This weakens dumpability back to checking only for uid/gid changes in
    current (which is useless), but userspace depends on dumpability not
    being tied to secureexec.

    https://bugzilla.redhat.com/show_bug.cgi?id=1528633

    Reported-by: Tom Horsley
    Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook