Eric Lee / smarc-fsl-linux-kernel

29 Nov, 2019

11 commits

fc3b55ef2 futex: Prevent exit livelock ... Browse Code »

commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream.

Oleg provided the following test case:

int main(void)
{
struct sched_param sp = {};

sp.sched_priority = 2;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

int lock = vfork();
if (!lock) {
sp.sched_priority = 1;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
_exit(0);
}

syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
return 0;
}

This creates an unkillable RT process spinning in futex_lock_pi() on a UP
machine or if the process is affine to a single CPU. The reason is:

parent child

set FIFO prio 2

vfork() -> set FIFO prio 1
implies wait_for_child() sched_setscheduler(...)
exit()
do_exit()
....
mm_release()
tsk->futex_state = FUTEX_STATE_EXITING;
exit_futex(); (NOOP in this case)
complete() --> wakes parent
sys_futex()
loop infinite because
tsk->futex_state == FUTEX_STATE_EXITING

The same problem can happen just by regular preemption as well:

task holds futex
...
do_exit()
tsk->futex_state = FUTEX_STATE_EXITING;

--> preemption (unrelated wakeup of some other higher prio task, e.g. timer)

switch_to(other_task)

return to user
sys_futex()
loop infinite as above

Just for the fun of it the futex exit cleanup could trigger the wakeup
itself before the task sets its futex state to DEAD.

To cure this, the handling of the exiting owner is changed so:

- A refcount is held on the task

- The task pointer is stored in a caller visible location

- The caller drops all locks (hash bucket, mmap_sem) and blocks
on task::futex_exit_mutex. When the mutex is acquired then
the exiting task has completed the cleanup and the state
is consistent and can be reevaluated.

This is not a pretty solution, but there is no choice other than returning
an error code to user space, which would break the state consistency
guarantee and open another can of problems including regressions.

For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538
are required as well, but for anything older than 5.3.y the backports are
going to be provided when this hits mainline as the other dependencies for
those kernels are definitely not stable material.

Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems")
Reported-by: Oleg Nesterov
Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Cc: Stable Team
Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:14 +0800
56690612a futex: Provide distinct return value when owner is exiting ... Browse Code »

commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream.

attach_to_pi_owner() returns -EAGAIN for various cases:

- Owner task is exiting
- Futex value has changed

The caller drops the held locks (hash bucket, mmap_sem) and retries the
operation. In case of the owner task exiting this can result in a live
lock.

As a preparatory step for seperating those cases, provide a distinct return
value (EBUSY) for the owner exiting case.

No functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:14 +0800
d3ba1e8d5 futex: Add mutex around futex exit ... Browse Code »

commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream.

The mutex will be used in subsequent changes to replace the busy looping of
a waiter when the futex owner is currently executing the exit cleanup to
prevent a potential live lock.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:13 +0800
172b09ddc futex: Provide state handling for exec() as well ... Browse Code »

commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream.

exec() attempts to handle potentially held futexes gracefully by running
the futex exit handling code like exit() does.

The current implementation has no protection against concurrent incoming
waiters. The reason is that the futex state cannot be set to
FUTEX_STATE_DEAD after the cleanup because the task struct is still active
and just about to execute the new binary.

While its arguably buggy when a task holds a futex over exec(), for
consistency sake the state handling can at least cover the actual futex
exit cleanup section. This provides state consistency protection accross
the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the
cleanup has been finished, this cannot prevent subsequent attempts to
attach to the task in case that the cleanup was not successfull in mopping
up all leftovers.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:13 +0800
398902dbc futex: Sanitize exit state handling ... Browse Code »

commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream.

Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move
the state setting into to the lock section.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:12 +0800
b2f4e1067 futex: Mark the begin of futex exit explicitly ... Browse Code »

commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream.

Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:11 +0800
350a30ce8 futex: Set task::futex_state to DEAD right after handling futex exit ... Browse Code »

commit f24f22435dcc11389acc87e5586239c1819d217c upstream.

Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:10 +0800
1bcee2337 futex: Split futex_mm_release() for exit/exec ... Browse Code »

commit 150d71584b12809144b8145b817e83b81158ae5f upstream.

To allow separate handling of the futex exit state in the futex exit code
for exit and exec, split futex_mm_release() into two functions and invoke
them from the corresponding exit/exec_mm_release() callsites.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:10 +0800
52507cfaf futex: Replace PF_EXITPIDONE with a state ... Browse Code »

commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream.

The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:09 +0800
8012f98f9 futex: Move futex exit handling into futex code ... Browse Code »

commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream.

The futex exit handling is #ifdeffed into mm_release() which is not pretty
to begin with. But upcoming changes to address futex exit races need to add
more functionality to this exit code.

Split it out into a function, move it into futex code and make the various
futex exit functions static.

Preparatory only and no functional change.

Folded build fix from Borislav.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:08 +0800
82ca3ab31 futex: Prevent robust futex exit race ... Browse Code »

commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream.

Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.

The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.

After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.

After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.

The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.

This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.

(1) Unlocking task is killed after unlocking the futex value in user space
before being able to wake a waiter.

pthread_mutex_unlock()
|
V
atomic_exchange_rel (&mutex->__data.__lock, 0)
wakeup()
|
|(return to userspace)
|(__lock = 0)
|
V
oldval = mutex->__data.__lock
__data.__lock, |
id | assume_other_futex_waiters, 0) |
|
|
(enter kernel)|
|
V
do_exit()
|
|
V
handle_futex_death()
|
|(__lock = 0)
|(uval = 0)
|
V
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;

The sanity check which ensures that the user space futex is owned
by the exiting task prevents the wakeup of waiters, which seems to
be correct as the exiting task does not own the futex value, but
the consequence is that other waiters wont be woken up and block
infinitely.

In both scenarios the following conditions are true:

- task->robust_list->list_op_pending != NULL
- user space futex value == 0
- Regular futex (not PI)

If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.

As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.

This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.

The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.

The eventually woken up waiter will observe the uncontended lock value and
take it over.

[ tglx: Massaged changelog and comment. Made the return explicit and not
depend on the subsequent check and added constants to hand into
handle_futex_death() instead of plain numbers. Fixed a few coding
style issues. ]

Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao
Signed-off-by: Yi Wang
Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Yang Tao
2019-11-29 17:10:01 +0800

01 Aug, 2019

2 commits

9dd8813ed hrtimer/treewide: Use hrtimer_sleeper_start_expires() ... Browse Code »

hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
that possible.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2019-08-01 23:43:16 +0800
dbc1625fc hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls ... Browse Code »

hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.

Combine the initialization and spare a function call. Fixup all call sites.

This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.

No functional change.

[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
hrtimer_init_sleeper() and trivial polishing.
Folded a fix from Stephen Rothwell for the vsoc code ]

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de

Sebastian Andrzej Siewior
2019-08-01 23:43:15 +0800

31 Jul, 2019

1 commit

b74494872 hrtimer: Remove task argument from hrtimer_init_sleeper() ... Browse Code »

All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de

Thomas Gleixner
2019-07-31 05:57:51 +0800

03 Jun, 2019

1 commit

26b73da36 Merge tag 'v5.2-rc3' into locking/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2019-06-03 17:50:18 +0800

31 May, 2019

1 commit

1a59d1b8e treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Reviewed-by: Richard Fontana
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-31 02:26:35 +0800

29 May, 2019

1 commit

5ca584d93 futex: Consolidate duplicated timer setup code ... Browse Code »

Add a new futex_setup_timer() helper function to consolidate all the
hrtimer_sleeper setup code.

Signed-off-by: Waiman Long
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Link: https://lkml.kernel.org/r/20190528160345.24017-1-longman@redhat.com

Waiman Long
2019-05-29 02:12:00 +0800

15 May, 2019

1 commit

73b0140bf mm/gup: change GUP fast to use flags rather than a write 'bool' ... Browse Code »

To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality. New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny
Reviewed-by: Mike Marshall
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Dan Williams
Cc: "David S. Miller"
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: James Hogan
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Michal Hocko
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ira Weiny
2019-05-15 00:47:46 +0800

26 Apr, 2019

1 commit

6b4f4bc9c locking/futex: Allow low-level atomic operations to return -EAGAIN ... Browse Code »

Some futex() operations, including FUTEX_WAKE_OP, require the kernel to
perform an atomic read-modify-write of the futex word via the userspace
mapping. These operations are implemented by each architecture in
arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
are called in atomic context with the relevant hash bucket locks held.

Although these routines may return -EFAULT in response to a page fault
generated when accessing userspace, they are expected to succeed (i.e.
return 0) in all other cases. This poses a problem for architectures
that do not provide bounded forward progress guarantees or fairness of
contended atomic operations and can lead to starvation in some cases.

In these problematic scenarios, we must return back to the core futex
code so that we can drop the hash bucket locks and reschedule if
necessary, much like we do in the case of a page fault.

Allow architectures to return -EAGAIN from their implementations of
arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
will cause the core futex code to reschedule if necessary and return
back to the architecture code later on.

Cc:
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Will Deacon

Will Deacon
2019-04-26 20:57:31 +0800

22 Mar, 2019

1 commit

5a07168d8 futex: Ensure that futex address is aligned in handle_futex_death() ... Browse Code »

The futex code requires that the user space addresses of futexes are 32bit
aligned. sys_futex() checks this in futex_get_keys() but the robust list
code has no alignment check in place.

As a consequence the kernel crashes on architectures with strict alignment
requirements in handle_futex_death() when trying to cmpxchg() on an
unaligned futex address which was retrieved from the robust list.

[ tglx: Rewrote changelog, proper sizeof() based alignement check and add
comment ]

Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Chen Jie
Signed-off-by: Thomas Gleixner
Cc:
Cc:
Cc:
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com

Chen Jie
2019-03-22 20:05:26 +0800

06 Mar, 2019

2 commits

3478588b5 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The biggest part of this tree is the new auto-generated atomics API
wrappers by Mark Rutland.

The primary motivation was to allow instrumentation without uglifying
the primary source code.

The linecount increase comes from adding the auto-generated files to
the Git space as well:

include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++--
include/asm-generic/atomic-long.h | 1174 ++++++++++---
include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++
include/linux/atomic.h | 1241 +------------

I preferred this approach, so that the full call stack of the (already
complex) locking APIs is still fully visible in 'git grep'.

But if this is excessive we could certainly hide them.

There's a separate build-time mechanism to determine whether the
headers are out of date (they should never be stale if we do our job
right).

Anyway, nothing from this should be visible to regular kernel
developers.

Other changes:

- Add support for dynamic keys, which removes a source of false
positives in the workqueue code, among other things (Bart Van
Assche)

- Updates to tools/memory-model (Andrea Parri, Paul E. McKenney)

- qspinlock, wake_q and lockdep micro-optimizations (Waiman Long)

- misc other updates and enhancements"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/lockdep: Shrink struct lock_class_key
locking/lockdep: Add module_param to enable consistency checks
lockdep/lib/tests: Test dynamic key registration
lockdep/lib/tests: Fix run_tests.sh
kernel/workqueue: Use dynamic lockdep keys for workqueues
locking/lockdep: Add support for dynamic keys
locking/lockdep: Verify whether lock objects are small enough to be used as class keys
locking/lockdep: Check data structure consistency
locking/lockdep: Reuse lock chains that have been freed
locking/lockdep: Fix a comment in add_chain_cache()
locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count()
locking/lockdep: Reuse list entries that are no longer in use
locking/lockdep: Free lock classes that are no longer in use
locking/lockdep: Update two outdated comments
locking/lockdep: Make it easy to detect whether or not inside a selftest
locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock()
locking/lockdep: Initialize the locks_before and locks_after lists earlier
locking/lockdep: Make zap_class() remove all matching lock order entries
locking/lockdep: Reorder struct lock_class members
locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache
...

Linus Torvalds
2019-03-06 23:17:17 +0800
b1b988a6a Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull year 2038 updates from Thomas Gleixner:
"Another round of changes to make the kernel ready for 2038. After lots
of preparatory work this is the first set of syscalls which are 2038
safe:

403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64

The syscall numbers are identical all over the architectures"

* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
riscv: Use latest system call ABI
checksyscalls: fix up mq_timedreceive and stat exceptions
unicore32: Fix __ARCH_WANT_STAT64 definition
asm-generic: Make time32 syscall numbers optional
asm-generic: Drop getrlimit and setrlimit syscalls from default list
32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
compat ABI: use non-compat openat and open_by_handle_at variants
y2038: add 64-bit time_t syscalls to all 32-bit architectures
y2038: rename old time and utime syscalls
y2038: remove struct definition redirects
y2038: use time32 syscall names on 32-bit
syscalls: remove obsolete __IGNORE_ macros
y2038: syscalls: rename y2038 compat syscalls
x86/x32: use time64 versions of sigtimedwait and recvmmsg
timex: change syscalls to use struct __kernel_timex
timex: use __kernel_timex internally
sparc64: add custom adjtimex/clock_adjtime functions
time: fix sys_timer_settime prototype
time: Add struct __kernel_timex
time: make adjtime compat handling available for 32 bit
...

Linus Torvalds
2019-03-06 06:08:26 +0800

28 Feb, 2019

1 commit

0614621d8 Merge branch 'linus' into locking/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2019-02-28 14:50:39 +0800

11 Feb, 2019

2 commits

49262de22 futex: Convert futex_pi_state.refcount to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable futex_pi_state.refcount is used as pure
reference counter. Convert it to refcount_t and fix up
the operations.

**Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the futex_pi_state.refcount it might make a difference
in following places:

- get_pi_state() and exit_pi_state_list(): increment in
refcount_inc_not_zero() only guarantees control dependency
on success vs. fully ordered atomic counterpart
- put_pi_state(): decrement in refcount_dec_and_test() provides
RELEASE ordering and ACQUIRE ordering on success
vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: dvhart@infradead.org
Link: http://lkml.kernel.org/r/1549369467-3505-1-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-11 18:37:16 +0800
41ea39101 Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git… ... Browse Code »

…/arnd/playground into timers/2038

Pull y2038 - time64 system calls from Arnd Bergmann:

This series finally gets us to the point of having system calls with 64-bit
time_t on all architectures, after a long time of incremental preparation
patches.

There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.

The following system calls are now added on all 32-bit architectures using
the same system call numbers:

403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64

Each one of these corresponds directly to an existing system call that
includes a 'struct timespec' argument, or a structure containing a timespec
or (in case of clock_adjtime) timeval. Not included here are new versions
of getitimer/setitimer and getrusage/waitid, which are planned for the
future but only needed to make a consistent API rather than for correct
operation beyond y2038. These four system calls are based on 'timeval', and
it has not been finally decided what the replacement kernel interface will
use instead.

So far, I have done a lot of build testing across most architectures, which
has found a number of bugs. Runtime testing so far included testing LTP on
32-bit ARM with the existing system calls, to ensure we do not regress for
existing binaries, and a test with a 32-bit x86 build of LTP against a
modified version of the musl C library that has been adapted to the new
system call interface [3]. This library can be used for testing on all
architectures supported by musl-1.1.21, but it is not how the support is
getting integrated into the official musl release. Official musl support is
planned but will require more invasive changes to the library.

Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]

Thomas Gleixner
2019-02-11 04:24:43 +0800

08 Feb, 2019

2 commits

1a1fb985f futex: Handle early deadlock return correctly ... Browse Code »

commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi() futex_unlock_pi()
lock(hb->lock)
queue(hb_waiter) lock(hb->lock)
lock(rtmutex->wait_lock)
unlock(hb->lock)
// acquired hb->lock
hb_waiter = futex_top_waiter()
lock(rtmutex->wait_lock)
__rt_mutex_proxy_start()
---> fail
remove(rtmutex_waiter);
---> returns -EDEADLOCK
unlock(rtmutex->wait_lock)
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner()
--> returns NULL
--> WARN

lock(hb->lock)
unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens
Co-developed-by: Peter Zijlstra
Signed-off-by: Peter Zijlstra
Signed-off-by: Thomas Gleixner
Tested-by: Heiko Carstens
Cc: Martin Schwidefsky
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler
Cc: Sebastian Sewior
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

Thomas Gleixner
2019-02-08 20:00:36 +0800
6f568ebe2 futex: Fix barrier comment ... Browse Code »

The current comment for the barrier that guarantees that waiter increment
is always before taking the hb spinlock (barrier (A)) needs to be fixed as
it is misplaced.

This is obviously referring to hb_waiters_inc, which is a full barrier.

Reported-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190206185602.949-1-dave@stgolabs.net

Davidlohr Bueso
2019-02-08 20:00:35 +0800

07 Feb, 2019

1 commit

8dabe7245 y2038: syscalls: rename y2038 compat syscalls ... Browse Code »

A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas
Signed-off-by: Arnd Bergmann

Arnd Bergmann
2019-02-07 07:13:27 +0800

04 Feb, 2019

1 commit

07879c6a3 sched/wake_q: Reduce reference counting for special users ... Browse Code »

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: Xie Yongji
Cc: Yongji Xie
Cc: andrea.parri@amarulasolutions.com
Cc: lilin24@baidu.com
Cc: liuqi16@baidu.com
Cc: nixun@baidu.com
Cc: yuanlinsi01@baidu.com
Cc: zhangyu31@baidu.com
Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2019-02-04 16:03:28 +0800

30 Jan, 2019

1 commit

0365aeba5 futex: No need to check return value of debugfs_create functions ... Browse Code »

When calling debugfs functions, there is no need to ever check the return
value. The function can work or not, but the code logic should never do
something different based on this.

Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart (VMware)
Cc: Peter Zijlstra
Link: https://lkml.kernel.org/r/20190122152151.16139-40-gregkh@linuxfoundation.org

Greg Kroah-Hartman
2019-01-30 03:15:48 +0800

21 Jan, 2019

1 commit

b061c38be futex: Fix (possible) missed wakeup ... Browse Code »

We must not rely on wake_q_add() to delay the wakeup; in particular
commit:

1d0dcb3ad9d3 ("futex: Implement lockless wakeups")

moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-01-21 18:15:38 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

29 Dec, 2018

1 commit

b12a9124e Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull y2038 updates from Arnd Bergmann:
"More syscalls and cleanups

This concludes the main part of the system call rework for 64-bit
time_t, which has spread over most of year 2018, the last six system
calls being

- ppoll
- pselect6
- io_pgetevents
- recvmmsg
- futex
- rt_sigtimedwait

As before, nothing changes for 64-bit architectures, while 32-bit
architectures gain another entry point that differs only in the layout
of the timespec structure. Hopefully in the next release we can wire
up all 22 of those system calls on all 32-bit architectures, which
gives us a baseline version for glibc to start using them.

This does not include the clock_adjtime, getrusage/waitid, and
getitimer/setitimer system calls. I still plan to have new versions of
those as well, but they are not required for correct operation of the
C library since they can be emulated using the old 32-bit time_t based
system calls.

Aside from the system calls, there are also a few cleanups here,
removing old kernel internal interfaces that have become unused after
all references got removed. The arch/sh cleanups are part of this,
there were posted several times over the past year without a reaction
from the maintainers, while the corresponding changes made it into all
other architectures"

* tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
timekeeping: remove obsolete time accessors
vfs: replace current_kernel_time64 with ktime equivalent
timekeeping: remove timespec_add/timespec_del
timekeeping: remove unused {read,update}_persistent_clock
sh: remove board_time_init() callback
sh: remove unused rtc_sh_get/set_time infrastructure
sh: sh03: rtc: push down rtc class ops into driver
sh: dreamcast: rtc: push down rtc class ops into driver
y2038: signal: Add compat_sys_rt_sigtimedwait_time64
y2038: signal: Add sys_rt_sigtimedwait_time32
y2038: socket: Add compat_sys_recvmmsg_time64
y2038: futex: Add support for __kernel_timespec
y2038: futex: Move compat implementation into futex.c
io_pgetevents: use __kernel_timespec
pselect6: use __kernel_timespec
ppoll: use __kernel_timespec
signal: Add restore_user_sigmask()
signal: Add set_user_sigmask()

Linus Torvalds
2018-12-29 04:45:04 +0800

19 Dec, 2018

1 commit

da791a667 futex: Cure exit race ... Browse Code »

Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():

CPU0 CPU1

sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH;
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Heiko Carstens
Cc: Darren Hart
Cc: Ingo Molnar
Cc: Sasha Levin
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de

Thomas Gleixner
2018-12-19 06:13:15 +0800

08 Dec, 2018

2 commits

bec2f7cbb y2038: futex: Add support for __kernel_timespec ... Browse Code »

This prepares sys_futex for y2038 safe calling: the native
syscall is changed to receive a __kernel_timespec argument, which
will be switched to 64-bit time_t in the future. All the internal
time handling gets changed to timespec64, and the compat_sys_futex
entry point is moved under the CONFIG_COMPAT_32BIT_TIME check
to provide compatibility for existing 32-bit architectures.

Signed-off-by: Arnd Bergmann

Arnd Bergmann
2018-12-08 05:19:07 +0800
04e7712f4 y2038: futex: Move compat implementation into futex.c ... Browse Code »

We are going to share the compat_sys_futex() handler between 64-bit
architectures and 32-bit architectures that need to deal with both 32-bit
and 64-bit time_t, and this is easier if both entry points are in the
same file.

In fact, most other system call handlers do the same thing these days, so
let's follow the trend here and merge all of futex_compat.c into futex.c.

In the process, a few minor changes have to be done to make sure everything
still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become
local symbol, and the compat version of the fetch_robust_entry() function
gets renamed to compat_fetch_robust_entry() to avoid a symbol clash.

This is intended as a purely cosmetic patch, no behavior should
change.

Signed-off-by: Arnd Bergmann

Arnd Bergmann
2018-12-08 05:19:07 +0800

31 Oct, 2018

1 commit

57c8a661d mm: remove include/linux/bootmem.h ... Browse Code »

Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include

@@
@@
- #include
+ #include

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Stephen Rothwell
Acked-by: Michal Hocko
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: "James E.J. Bottomley"
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Serge Semin
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-10-31 23:54:16 +0800

09 Oct, 2018

1 commit

4de1a293a futex: Replace spin_is_locked() with lockdep ... Browse Code »

lockdep_assert_held() is better suited for checking locking requirements,
since it won't get confused when the lock is held by some other task. This
is also a step towards possibly removing spin_is_locked().

Signed-off-by: Lance Roy
Signed-off-by: Thomas Gleixner
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Darren Hart
Link: https://lkml.kernel.org/r/20181003053902.6910-12-ldr709@gmail.com

Lance Roy
2018-10-09 19:19:28 +0800

21 Aug, 2018

1 commit

b639186ff futex: Mark expected switch fall-throughs ... Browse Code »

In preparation of enabling -Wimplicit-fallthrough, mark switch cases which
fall through.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Link: https://lkml.kernel.org/r/20180816172124.GA2407@embeddedor.com

Gustavo A. R. Silva
2018-08-21 00:23:00 +0800

07 Feb, 2018

1 commit

2ee082608 pids: introduce find_get_task_by_vpid() helper ... Browse Code »

There are several functions that do find_task_by_vpid() followed by
get_task_struct(). We can use a helper function instead.

Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-02-07 10:32:46 +0800