01 Oct, 2020

1 commit

  • [ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ]

    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace. The possible indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
    cred_guard_mutex is held over "get_user(futex_offset, ...") in
    exit_robust_list. The cred_guard_mutex held over copy_strings.

    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.

    In any of those cases the userspace process that the kernel is waiting
    for might make a different system call that winds up taking the
    cred_guard_mutex and result in deadlock.

    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary. Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.

    The plan is to switch the users of cred_guard_mutex to
    exec_update_mutex one by one. This lets us move forward while still
    being careful and not introducing any regressions.

    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Reviewed-by: Kirill Tkhai
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     

20 May, 2020

1 commit

  • commit f87d1c9559164294040e58f5e3b74a162bf7c6e8 upstream.

    I goofed when I added mm->user_ns support to would_dump. I missed the
    fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
    binfmt_script bprm->file is reassigned. Which made the move of
    would_dump from setup_new_exec to __do_execve_file before exec_binprm
    incorrect as it can result in would_dump running on the script instead
    of the interpreter of the script.

    The net result is that the code stopped making unreadable interpreters
    undumpable. Which allows them to be ptraced and written to disk
    without special permissions. Oops.

    The move was necessary because the call in set_new_exec was after
    bprm->mm was no longer valid.

    To correct this mistake move the misplaced would_dump from
    __do_execve_file into flos_old_exec, before exec_mmap is called.

    I tested and confirmed that without this fix I can attach with gdb to
    a script with an unreadable interpreter, and with this fix I can not.

    Cc: stable@vger.kernel.org
    Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

17 Apr, 2020

1 commit

  • commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.

    Replace the 32bit exec_id with a 64bit exec_id to make it impossible
    to wrap the exec_id counter. With care an attacker can cause exec_id
    wrap and send arbitrary signals to a newly exec'd parent. This
    bypasses the signal sending checks if the parent changes their
    credentials during exec.

    The severity of this problem can been seen that in my limited testing
    of a 32bit exec_id it can take as little as 19s to exec 65536 times.
    Which means that it can take as little as 14 days to wrap a 32bit
    exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
    days. Even my slower timing is in the uptime of a typical server.
    Which means self_exec_id is simply a speed bump today, and if exec
    gets noticably faster self_exec_id won't even be a speed bump.

    Extending self_exec_id to 64bits introduces a problem on 32bit
    architectures where reading self_exec_id is no longer atomic and can
    take two read instructions. Which means that is is possible to hit
    a window where the read value of exec_id does not match the written
    value. So with very lucky timing after this change this still
    remains expoiltable.

    I have updated the update of exec_id on exec to use WRITE_ONCE
    and the read of exec_id in do_notify_parent to use READ_ONCE
    to make it clear that there is no locking between these two
    locations.

    Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
    Fixes: 2.3.23pre2
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

29 Nov, 2019

1 commit

  • commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream.

    mm_release() contains the futex exit handling. mm_release() is called from
    do_exit()->exit_mm() and from exec()->exec_mm().

    In the exit_mm() case PF_EXITING and the futex state is updated. In the
    exec_mm() case these states are not touched.

    As the futex exit code needs further protections against exit races, this
    needs to be split into two functions.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Sep, 2019

1 commit

  • The membarrier_state field is located within the mm_struct, which
    is not guaranteed to exist when used from runqueue-lock-free iteration
    on runqueues by the membarrier system call.

    Copy the membarrier_state from the mm_struct into the scheduler runqueue
    when the scheduler switches between mm.

    When registering membarrier for mm, after setting the registration bit
    in the mm membarrier state, issue a synchronize_rcu() to ensure the
    scheduler observes the change. In order to take care of the case
    where a runqueue keeps executing the target mm without swapping to
    other mm, iterate over each runqueue and issue an IPI to copy the
    membarrier_state from the mm_struct into each runqueue which have the
    same mm which state has just been modified.

    Move the mm membarrier_state field closer to pgd in mm_struct to use
    a cache line already touched by the scheduler switch_mm.

    The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
    clear the runqueue's membarrier state in addition to clear the mm
    membarrier state, so move its implementation into the scheduler
    membarrier code so it can access the runqueue structure.

    Add memory barrier in membarrier_exec_mmap() prior to clearing
    the membarrier state, ensuring memory accesses executed prior to exec
    are not reordered with the stores clearing the membarrier state.

    As suggested by Linus, move all membarrier.c RCU read-side locks outside
    of the for each cpu loops.

    Suggested-by: Linus Torvalds
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

25 Jul, 2019

1 commit

  • When going through execve(), zero out the NUMA fault statistics instead of
    freeing them.

    During execve, the task is reachable through procfs and the scheduler. A
    concurrent /proc/*/sched reader can read data from a freed ->numa_faults
    allocation (confirmed by KASAN) and write it back to userspace.
    I believe that it would also be possible for a use-after-free read to occur
    through a race between a NUMA fault and execve(): task_numa_fault() can
    lead to task_numa_compare(), which invokes task_weight() on the currently
    running task of a different CPU.

    Another way to fix this would be to make ->numa_faults RCU-managed or add
    extra locking, but it seems easier to wipe the NUMA fault statistics on
    execve.

    Signed-off-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
    Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
    Signed-off-by: Ingo Molnar

    Jann Horn
     

09 Jul, 2019

1 commit

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     

27 May, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit


08 Mar, 2019

2 commits

  • Large enterprise clients often run applications out of networked file
    systems where the IT mandated layout of project volumes can end up
    leading to paths that are longer than 128 characters. Bumping this up
    to the next order of two solves this problem in all but the most
    egregious case while still fitting into a 512b slab.

    [oleg@redhat.com: update comment, per Kees]
    Link: http://lkml.kernel.org/r/20181112160956.GA28472@redhat.com
    Signed-off-by: Oleg Nesterov
    Reported-by: Ben Woodard
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Kees Cook
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Link: http://lkml.kernel.org/r/1548275584-18096-2-git-send-email-vgupta@synopsys.com
    Link: http://lkml.kernel.org/g/20150807115710.GA16897@redhat.com
    Signed-off-by: Vineet Gupta
    Reviewed-by: Anthony Yznaga
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Peter Zijlstra (Intel)
    Cc: Chris Wilson
    Cc: Ingo Molnar
    Cc: Jani Nikula
    Cc: Miklos Szeredi
    Cc: Theodore Ts'o
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

07 Mar, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

19 Feb, 2019

1 commit

  • syzkaller report this:
    BUG: memory leak
    unreferenced object 0xffffc9000488d000 (size 9195520):
    comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
    hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................
    02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z.....
    backtrace:
    [] __vmalloc_node mm/vmalloc.c:1795 [inline]
    [] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
    [] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
    [] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
    [] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
    [] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
    [] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    It should goto 'out_free' lable to free allocated buf while kernel_read
    fails.

    Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory")
    Signed-off-by: YueHaibing
    Signed-off-by: Al Viro

    YueHaibing
     

04 Feb, 2019

1 commit

  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable sighand_struct.count is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts.

    The full comparison can be seen in
    https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
    in state to be merged to the documentation tree.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.

    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the sighand_struct.count it might make a difference
    in following places:

    - __cleanup_sighand: decrement in refcount_dec_and_test() only
    provides RELEASE ordering and control dependency on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Reviewed-by: Andrea Parri
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     

06 Jan, 2019

1 commit


05 Jan, 2019

2 commits

  • This is already done for us internally by the signal machinery.

    [akpm@linux-foundation.org: fix fs/buffer.c]
    Link: http://lkml.kernel.org/r/20181116002713.8474-7-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the
    "extra" size for argv/envp pointers every time, this is a bit ugly and
    even not strictly correct: acct_arg_size() must not account this size.

    Remove all the rlimit code in get_arg_page(). Instead, add bprm->argmin
    calculated once at the start of __do_execve_file() and change
    copy_strings to check bprm->p >= bprm->argmin.

    The patch adds the new helper, prepare_arg_pages() which initializes
    bprm->argc/envc and bprm->argmin.

    [oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()]
    Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com
    [akpm@linux-foundation.org: use max_t]
    Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Tested-by: Guenter Roeck
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Dec, 2018

1 commit


04 Dec, 2018

1 commit

  • Revert commit c22397888f1e "exec: make de_thread() freezable" as
    requested by Ingo Molnar:

    "So there's a new regression in v4.20-rc4, my desktop produces this
    lockdep splat:

    [ 1772.588771] WARNING: pkexec/4633 still has locks held!
    [ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
    [ 1772.588775] ------------------------------------
    [ 1772.588776] 1 lock held by pkexec/4633:
    [ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
    [ 1772.588786] stack backtrace:
    [ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
    [ 1772.588792] Call Trace:
    [ 1772.588800] dump_stack+0x85/0xcb
    [ 1772.588803] flush_old_exec+0x116/0x890
    [ 1772.588807] ? load_elf_phdrs+0x72/0xb0
    [ 1772.588809] load_elf_binary+0x291/0x1620
    [ 1772.588815] ? sched_clock+0x5/0x10
    [ 1772.588817] ? search_binary_handler+0x6d/0x240
    [ 1772.588820] search_binary_handler+0x80/0x240
    [ 1772.588823] load_script+0x201/0x220
    [ 1772.588825] search_binary_handler+0x80/0x240
    [ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
    [ 1772.588832] ? strncpy_from_user+0x40/0x180
    [ 1772.588835] __x64_sys_execve+0x34/0x40
    [ 1772.588838] do_syscall_64+0x60/0x1c0

    The warning gets triggered by an ancient lockdep check in the freezer:

    (gdb) list *0xffffffff812ece06
    0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
    52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
    53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
    54 */
    55 static inline bool try_to_freeze_unsafe(void)
    56 {
    57 might_sleep();
    58 if (likely(!freezing(current)))
    59 return false;
    60 return __refrigerator(false);
    61 }

    I reviewed the ->cred_guard_mutex code, and the mutex is held across all
    of exec() - and we always did this.

    But there's this recent -rc4 commit:

    > Chanho Min (1):
    > exec: make de_thread() freezable

    c22397888f1e: exec: make de_thread() freezable

    I believe this commit is bogus, you cannot call try_to_freeze() from
    de_thread(), because it's holding the ->cred_guard_mutex."

    Reported-by: Ingo Molnar
    Tested-by: Ingo Molnar
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

19 Nov, 2018

1 commit

  • Suspend fails due to the exec family of functions blocking the freezer.
    The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
    all sub-threads to die, and we have the deadlock if one of them is frozen.
    This also can occur with the schedule() waiting for the group thread leader
    to exit if it is frozen.

    In our machine, it causes freeze timeout as bellows.

    Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
    setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
    Call trace:
    [] __switch_to+0x88/0xa0
    [] __schedule+0x1bc/0x720
    [] schedule+0x40/0xa8
    [] flush_old_exec+0xdc/0x640
    [] load_elf_binary+0x2a8/0x1090
    [] search_binary_handler+0x9c/0x240
    [] load_script+0x20c/0x228
    [] search_binary_handler+0x9c/0x240
    [] do_execveat_common.isra.14+0x4f8/0x6e8
    [] compat_SyS_execve+0x38/0x48
    [] el0_svc_naked+0x24/0x28

    To fix this, make de_thread() freezable. It looks safe and works fine.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Chanho Min
    Acked-by: Oleg Nesterov
    Acked-by: Pavel Machek
    Acked-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Chanho Min
     

11 Oct, 2018

1 commit

  • On 32-bit systems, the buffer allocated by kernel_read_file() is too
    small if the file size is > SIZE_MAX, due to truncation to size_t.

    Fortunately, since the 'count' argument to kernel_read() is also
    truncated to size_t, only the allocated space is filled; then, -EIO is
    returned since 'pos != i_size' after the read loop.

    But this is not obvious and seems incidental. We should be more
    explicit about this case. So, fail early if i_size > SIZE_MAX.

    Signed-off-by: Eric Biggers
    Signed-off-by: Mimi Zohar

    Eric Biggers
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
    VMA. This is unreliable as ->mmap may not set ->vm_ops.

    False-positive vma_is_anonymous() may lead to crashes:

    next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
    prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
    pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
    flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
    ------------[ cut here ]------------
    kernel BUG at mm/memory.c:1422!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
    RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
    RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
    RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
    Call Trace:
    unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
    zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
    unmap_mapping_range_vma mm/memory.c:2792 [inline]
    unmap_mapping_range_tree mm/memory.c:2813 [inline]
    unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
    unmap_mapping_range+0x48/0x60 mm/memory.c:2880
    truncate_pagecache+0x54/0x90 mm/truncate.c:800
    truncate_setsize+0x70/0xb0 mm/truncate.c:826
    simple_setattr+0xe9/0x110 fs/libfs.c:409
    notify_change+0xf13/0x10f0 fs/attr.c:335
    do_truncate+0x1ac/0x2b0 fs/open.c:63
    do_sys_ftruncate+0x492/0x560 fs/open.c:205
    __do_sys_ftruncate fs/open.c:215 [inline]
    __se_sys_ftruncate fs/open.c:213 [inline]
    __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
    #define KCOV_ENABLE _IO('c', 100)
    #define KCOV_DISABLE _IO('c', 101)
    #define COVER_SIZE (1024<<< 20);
    return 0;
    }

    This can be fixed by assigning anonymous VMAs own vm_ops and not relying
    on it being NULL.

    If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
    dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

    Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jul, 2018

2 commits

  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jul, 2018

1 commit

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 Jun, 2018

1 commit

  • Pull restartable sequence support from Thomas Gleixner:
    "The restartable sequences syscall (finally):

    After a lot of back and forth discussion and massive delays caused by
    the speculative distraction of maintainers, the core set of
    restartable sequences has finally reached a consensus.

    It comes with the basic non disputed core implementation along with
    support for arm, powerpc and x86 and a full set of selftests

    It was exposed to linux-next earlier this week, so it does not fully
    comply with the merge window requirements, but there is really no
    point to drag it out for yet another cycle"

    * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rseq/selftests: Provide Makefile, scripts, gitignore
    rseq/selftests: Provide parametrized tests
    rseq/selftests: Provide basic percpu ops test
    rseq/selftests: Provide basic test
    rseq/selftests: Provide rseq library
    selftests/lib.mk: Introduce OVERRIDE_TARGETS
    powerpc: Wire up restartable sequences system call
    powerpc: Add syscall detection for restartable sequences
    powerpc: Add support for restartable sequences
    x86: Wire up restartable sequence system call
    x86: Add support for restartable sequences
    arm: Wire up restartable sequences system call
    arm: Add syscall detection for restartable sequences
    arm: Add restartable sequences support
    rseq: Introduce restartable sequences system call
    uapi/headers: Provide types_32_64.h

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

24 May, 2018

1 commit

  • Introduce helper:
    int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
    struct umh_info {
    struct file *pipe_to_umh;
    struct file *pipe_from_umh;
    pid_t pid;
    };

    that GPLed kernel modules (signed or unsigned) can use it to execute part
    of its own data as swappable user mode process.

    The kernel will do:
    - allocate a unique file in tmpfs
    - populate that file with [data, data + len] bytes
    - user-mode-helper code will do_execve that file and, before the process
    starts, the kernel will create two unix pipes for bidirectional
    communication between kernel module and umh
    - close tmpfs file, effectively deleting it
    - the fork_usermode_blob will return zero on success and populate
    'struct umh_info' with two unix pipes and the pid of the user process

    As the first step in the development of the bpfilter project
    the fork_usermode_blob() helper is introduced to allow user mode code
    to be invoked from a kernel module. The idea is that user mode code plus
    normal kernel module code are built as part of the kernel build
    and installed as traditional kernel module into distro specified location,
    such that from a distribution point of view, there is
    no difference between regular kernel modules and kernel modules + umh code.
    Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
    by a kernel module doesn't make it any special from kernel and user space
    tooling point of view.

    Such approach enables kernel to delegate functionality traditionally done
    by the kernel modules into the user space processes (either root or !root) and
    reduces security attack surface of the new code. The buggy umh code would crash
    the user process, but not the kernel. Another advantage is that umh code
    of the kernel module can be debugged and tested out of user space
    (e.g. opening the possibility to run clang sanitizers, fuzzers or
    user space test suites on the umh code).
    In case of the bpfilter project such architecture allows complex control plane
    to be done in the user space while bpf based data plane stays in the kernel.

    Since umh can crash, can be oom-ed by the kernel, killed by the admin,
    the kernel module that uses them (like bpfilter) needs to manage life
    time of umh on its own via two unix pipes and the pid of umh.

    The exit code of such kernel module should kill the umh it started,
    so that rmmod of the kernel module will cleanup the corresponding umh.
    Just like if the kernel module does kmalloc() it should kfree() it
    in the exit code.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

12 Apr, 2018

3 commits

  • Since the stack rlimit is used in multiple places during exec and it can
    be changed via other threads (via setrlimit()) or processes (via
    prlimit()), the assumption that the value doesn't change cannot be made.
    This leads to races with mm layout selection and argument size
    calculations. This changes the exec path to use the rlimit stored in
    bprm instead of in current. Before starting the thread, the bprm stack
    rlimit is stored back to current.

    Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Andy Lutomirski
    Reported-by: Brad Spengler
    Acked-by: Michal Hocko
    Cc: Ben Hutchings
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Provide a final callback into fs/exec.c before start_thread() takes
    over, to handle any last-minute changes, like the coming restoration of
    the stack limit.

    Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

19 Mar, 2018

1 commit

  • The LSM check should happen after the file has been confirmed to be
    unchanging. Without this, we could have a race between the Time of Check
    (the call to security_kernel_read_file() which could read the file and
    make access policy decisions) and the Time of Use (starting with
    kernel_read_file()'s reading of the file contents). In theory, file
    contents could change between the two.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Signed-off-by: James Morris

    Kees Cook
     

04 Jan, 2018

1 commit

  • This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
    for setting dumpability")

    This weakens dumpability back to checking only for uid/gid changes in
    current (which is useless), but userspace depends on dumpability not
    being tied to secureexec.

    https://bugzilla.redhat.com/show_bug.cgi?id=1528633

    Reported-by: Tom Horsley
    Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

18 Dec, 2017

1 commit

  • This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

    SELinux runs with secureexec for all non-"noatsecure" domain transitions,
    which means lots of processes end up hitting the stack hard-limit change
    that was introduced in order to fix a race with prlimit(). That race fix
    will need to be redesigned.

    Reported-by: Laura Abbott
    Reported-by: Tomáš Trnka
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

15 Dec, 2017

1 commit

  • gcc-8 warns about using strncpy() with the source size as the limit:

    fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

    This is indeed slightly suspicious, as it protects us from source
    arguments without NUL-termination, but does not guarantee that the
    destination is terminated.

    This keeps the strncpy() to ensure we have properly padded target
    buffer, but ensures that we use the correct length, by passing the
    actual length of the destination buffer as well as adding a build-time
    check to ensure it is exactly TASK_COMM_LEN.

    There are only 23 callsites which I all reviewed to ensure this is
    currently the case. We could get away with doing only the check or
    passing the right length, but it doesn't hurt to do both.

    Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Cc: Alexander Viro
    Cc: Peter Zijlstra
    Cc: Serge Hallyn
    Cc: James Morris
    Cc: Aleksa Sarai
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

30 Nov, 2017

1 commit

  • While the defense-in-depth RLIMIT_STACK limit on setuid processes was
    protected against races from other threads calling setrlimit(), I missed
    protecting it against races from external processes calling prlimit().
    This adds locking around the change and makes sure that rlim_max is set
    too.

    Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Brad Spengler
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: Jiri Slaby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers