Eric Lee / smarc-fsl-linux-kernel

05 Mar, 2020

1 commit

2482da809 locking/lockdep: Fix lockdep_stats indentation problem ... Browse Code »

commit a030f9767da1a6bbcec840fc54770eb11c2414b6 upstream.

It was found that two lines in the output of /proc/lockdep_stats have
indentation problem:

# cat /proc/lockdep_stats
:
in-process chains: 25057
stack-trace entries: 137827 [max: 524288]
number of stack traces: 7973
number of stack hash chains: 6355
combined max dependencies: 1356414598
hardirq-safe locks: 57
hardirq-unsafe locks: 1286
:

All the numbers displayed in /proc/lockdep_stats except the two stack
trace numbers are formatted with a field with of 11. To properly align
all the numbers, a field width of 11 is now added to the two stack
trace numbers.

Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics")
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Bart Van Assche
Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2020-03-05 23:43:51 +0800

23 Jan, 2020

2 commits

b90976941 locking/lockdep: Fix buffer overrun problem in stack_trace[] ... Browse Code »

commit d91f3057263ceb691ef527e71b41a56b17f6c869 upstream.

If the lockdep code is really running out of the stack_trace entries,
it is likely that buffer overrun can happen and the data immediately
after stack_trace[] will be corrupted.

If there is less than LOCK_TRACE_SIZE_IN_LONGS entries left before
the call to save_trace(), the max_entries computation will leave it
with a very large positive number because of its unsigned nature. The
subsequent call to stack_trace_save() will then corrupt the data after
stack_trace[]. Fix that by changing max_entries to a signed integer
and check for negative value before calling stack_trace_save().

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Bart Van Assche
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 12593b7467f9 ("locking/lockdep: Reduce space occupied by stack traces")
Link: https://lkml.kernel.org/r/20191220135128.14876-1-longman@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2020-01-23 15:22:39 +0800
710d9fd2f locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN ... Browse Code »

commit 39e7234f00bc93613c086ae42d852d5f4147120a upstream.

The commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer
optimistically spin on owner") will allow a recently woken up waiting
writer to spin on the owner. Unfortunately, if the owner happens to be
RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a
kernel crash. This is fixed by passing the proper non-spinnable bits
to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated
as a non-spinnable target.

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")

Reported-by: Christoph Hellwig
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Christoph Hellwig
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200115154336.8679-1-longman@redhat.com
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2020-01-23 15:22:37 +0800

12 Jan, 2020

1 commit

c120c3dbe locking/spinlock/debug: Fix various data races ... Browse Code »

[ Upstream commit 1a365e822372ba24c9da0822bc583894f6f3d821 ]

This fixes various data races in spinlock_debug. By testing with KCSAN,
it is observable that the console gets spammed with data races reports,
suggesting these are extremely frequent.

Example data race report:

read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2:
debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline]
do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112
__raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
_raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:338 [inline]
get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873
get_partial_node mm/slub.c:1870 [inline]

write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3:
debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline]
do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline]
_raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191
spin_unlock_irqrestore include/linux/spinlock.h:393 [inline]
free_debug_processing+0x1b3/0x210 mm/slub.c:1214
__slab_free+0x292/0x400 mm/slub.c:2864

As a side-effect, with KCSAN, this eventually locks up the console, most
likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug ->
KCSAN detects data race -> kcsan_print_report() -> printk lock ->
deadlock.

This fix will 1) avoid the data races, and 2) allow using lock debugging
together with KCSAN.

Reported-by: Qian Cai
Signed-off-by: Marco Elver
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Marco Elver
2020-01-12 19:21:13 +0800

25 Sep, 2019

1 commit

89340d093 Revert "locking/pvqspinlock: Don't wait if vCPU is preempted" ... Browse Code »

This patch reverts commit 75437bb304b20 (locking/pvqspinlock: Don't
wait if vCPU is preempted). A large performance regression was caused
by this commit. on over-subscription scenarios.

The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads,
with three VMs of 80 vCPUs each. The score of ebizzy -M is reduced from
13000-14000 records/s to 1700-1800 records/s:

Host Guest score

vanilla w/o kvm optimizations upstream 1700-1800 records/s
vanilla w/o kvm optimizations revert 13000-14000 records/s
vanilla w/ kvm optimizations upstream 4500-5000 records/s
vanilla w/ kvm optimizations revert 14000-15500 records/s

Exit from aggressive wait-early mechanism can result in premature yield
and extra scheduling latency.

Actually, only 6% of wait_early events are caused by vcpu_is_preempted()
being true. However, when one vCPU voluntarily releases its vCPU, all
the subsequently waiters in the queue will do the same and the cascading
effect leads to bad performance.

kvm optimizations:
[1] commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts)
[2] commit 266e85a5ec9 (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption)

Tested-by: loobinliu@tencent.com
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Waiman Long
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: loobinliu@tencent.com
Cc: stable@vger.kernel.org
Fixes: 75437bb304b20 (locking/pvqspinlock: Don't wait if vCPU is preempted)
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini

Wanpeng Li
2019-09-25 16:22:37 +0800

17 Sep, 2019

2 commits

7e67a8599 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)

- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.

- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.

- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.

- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

- Improve CPU isolation housekeeping CPU allocation NUMA locality.

- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.

- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.

- Rework the active_mm state machine to be less confusing and more
optimal.

- Rework (simplify) the pick_next_task() slowpath.

- Improve load-balancing on AMD EPYC systems.

- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...

Linus Torvalds
2019-09-17 08:25:49 +0800
c7eba51cf Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:

- improve rwsem scalability

- add uninitialized rwsem debugging check

- reduce lockdep's stacktrace memory usage and add diagnostics

- misc cleanups, code consolidation and constification

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mutex: Fix up mutex_waiter usage
locking/mutex: Use mutex flags macro instead of hard code
locking/mutex: Make __mutex_owner static to mutex.c
locking/qspinlock,x86: Clarify virt_spin_lock_key
locking/rwsem: Check for operations on an uninitialized rwsem
locking/rwsem: Make handoff writer optimistically spin on owner
locking/lockdep: Report more stack trace statistics
locking/lockdep: Reduce space occupied by stack traces
stacktrace: Constify 'entries' arguments
locking/lockdep: Make it clear that what lock_class::key points at is not modified

Linus Torvalds
2019-09-17 07:49:55 +0800

16 Sep, 2019

1 commit

563c4f85f Merge branch 'sched/rt' into sched/core, to pick up -rt changes ... Browse Code »

Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar

Ingo Molnar
2019-09-16 20:05:04 +0800

08 Aug, 2019

1 commit

e57d14309 mutex: Fix up mutex_waiter usage ... Browse Code »

The patch moving bits into mutex.c was a little too much; by also
moving struct mutex_waiter a few less common CONFIGs would no longer
build.

Fixes: 5f35d5a66b3e ("locking/mutex: Make __mutex_owner static to mutex.c")
Signed-off-by: Peter Zijlstra (Intel)

Peter Zijlstra
2019-08-08 15:09:25 +0800

06 Aug, 2019

4 commits

a037d2692 locking/mutex: Use mutex flags macro instead of hard code ... Browse Code »

Use the mutex flag macro instead of hard code value inside
__mutex_owner().

Signed-off-by: Mukesh Ojha
Signed-off-by: Peter Zijlstra (Intel)
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-2-git-send-email-mojha@codeaurora.org

Mukesh Ojha
2019-08-06 18:49:16 +0800
5f35d5a66 locking/mutex: Make __mutex_owner static to mutex.c ... Browse Code »

__mutex_owner() should only be used by the mutex api's.
So, to put this restiction let's move the __mutex_owner()
function definition from linux/mutex.h to mutex.c file.

There exist functions that uses __mutex_owner() like
mutex_is_locked() and mutex_trylock_recursive(), So
to keep legacy thing intact move them as well and
export them.

Move mutex_waiter structure also to keep it private to the
file.

Signed-off-by: Mukesh Ojha
Signed-off-by: Peter Zijlstra (Intel)
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-1-git-send-email-mojha@codeaurora.org

Mukesh Ojha
2019-08-06 18:49:16 +0800
fce45cd41 locking/rwsem: Check for operations on an uninitialized rwsem ... Browse Code »

Currently rwsems is the only locking primitive that lacks this
debug feature. Add it under CONFIG_DEBUG_RWSEMS and do the magic
checking in the locking fastpath (trylock) operation such that
we cover all cases. The unlocking part is pretty straightforward.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: mingo@kernel.org
Cc: Davidlohr Bueso
Link: https://lkml.kernel.org/r/20190729044735.9632-1-dave@stgolabs.net

Davidlohr Bueso
2019-08-06 18:49:15 +0800
91d2a812d locking/rwsem: Make handoff writer optimistically spin on owner ... Browse Code »

When the handoff bit is set by a writer, no other tasks other than
the setting writer itself is allowed to acquire the lock. If the
to-be-handoff'ed writer goes to sleep, there will be a wakeup latency
period where the lock is free, but no one can acquire it. That is less
than ideal.

To reduce that latency, the handoff writer will now optimistically spin
on the owner if it happens to be a on-cpu writer. It will spin until
it releases the lock and the to-be-handoff'ed writer can then acquire
the lock immediately without any delay. Of course, if the owner is not
a on-cpu writer, the to-be-handoff'ed writer will have to sleep anyway.

The optimistic spinning code is also modified to not stop spinning
when the handoff bit is set. This will prevent an occasional setting of
handoff bit from causing a bunch of optimistic spinners from entering
into the wait queue causing significant reduction in throughput.

On a 1-socket 22-core 44-thread Skylake system, the AIM7 shared_memory
workload was run with 7000 users. The throughput (jobs/min) of the
following kernels were as follows:

1) 5.2-rc6
- 8,092,486
2) 5.2-rc6 + tip's rwsem patches
- 7,567,568
3) 5.2-rc6 + tip's rwsem patches + this patch
- 7,954,545

Using perf-record(1), the %cpu time used by rwsem_down_write_slowpath(),
rwsem_down_write_failed() and their callees for the 3 kernels were 1.70%,
5.46% and 2.08% respectively.

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: x86@kernel.org
Cc: Ingo Molnar
Cc: Will Deacon
Cc: huang ying
Cc: Tim Chen
Cc: Linus Torvalds
Cc: Borislav Petkov
Cc: Thomas Gleixner
Cc: Davidlohr Bueso
Cc: "H. Peter Anvin"
Link: https://lkml.kernel.org/r/20190625143913.24154-1-longman@redhat.com

Waiman Long
2019-08-06 18:49:15 +0800

02 Aug, 2019

1 commit

519248f36 lockdep: Make print_lock() address visible ... Browse Code »

Security is a wonderful thing, but so is the ability to debug based on
lockdep warnings. This commit therefore makes lockdep lock addresses
visible in the clear.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2019-08-02 05:05:51 +0800

25 Jul, 2019

11 commits

7b3c92b85 sched/core: Convert get_task_struct() to return the task ... Browse Code »

Returning the pointer that was passed in allows us to write
slightly more idiomatic code. Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar

Matthew Wilcox (Oracle)
2019-07-25 21:51:54 +0800
8c779229d locking/lockdep: Report more stack trace statistics ... Browse Code »

Report the number of stack traces and the number of stack trace hash
chains. These two numbers are useful because these allow to estimate
the number of stack trace hash collisions.

Signed-off-by: Bart Van Assche
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Link: https://lkml.kernel.org/r/20190722182443.216015-5-bvanassche@acm.org
Signed-off-by: Ingo Molnar

Bart Van Assche
2019-07-25 21:43:28 +0800
12593b746 locking/lockdep: Reduce space occupied by stack traces ... Browse Code »

Although commit 669de8bda87b ("kernel/workqueue: Use dynamic lockdep keys
for workqueues") unregisters dynamic lockdep keys when a workqueue is
destroyed, a side effect of that commit is that all stack traces
associated with the lockdep key are leaked when a workqueue is destroyed.
Fix this by storing each unique stack trace once. Other changes in this
patch are:

- Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
- Store a pointer to a stack trace in struct lock_class and struct
lock_list instead of storing 'nr_entries' and 'offset'.

This patch avoids that the following program triggers the "BUG:
MAX_STACK_TRACE_ENTRIES too low!" complaint:

#include
#include

int main()
{
for (;;) {
int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
close(fd);
}
}

Suggested-by: Peter Zijlstra
Reported-by: Eric Biggers
Signed-off-by: Bart Van Assche
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: Yuyang Du
Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
Signed-off-by: Ingo Molnar

Bart Van Assche
2019-07-25 21:43:27 +0800
364f6afc4 locking/lockdep: Make it clear that what lock_class::key points at is not modified ... Browse Code »

This patch does not change the behavior of the lockdep code.

Signed-off-by: Bart Van Assche
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Link: https://lkml.kernel.org/r/20190722182443.216015-2-bvanassche@acm.org
Signed-off-by: Ingo Molnar

Bart Van Assche
2019-07-25 21:43:26 +0800
6c11c6e3d locking/mutex: Test for initialized mutex ... Browse Code »

An uninitialized/ zeroed mutex will go unnoticed because there is no
check for it. There is a magic check in the unlock's slowpath path which
might go unnoticed if the unlock happens in the fastpath.

Add a ->magic check early in the mutex_lock() and mutex_trylock() path.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de
Signed-off-by: Ingo Molnar

Sebastian Andrzej Siewior
2019-07-25 21:39:27 +0800
30a35f79f locking/lockdep: Clean up #ifdef checks ... Browse Code »

As Will Deacon points out, CONFIG_PROVE_LOCKING implies TRACE_IRQFLAGS,
so the conditions I added in the previous patch, and some others in the
same file can be simplified by only checking for the former.

No functional change.

Signed-off-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Andrew Morton
Cc: Bart Van Assche
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Yuyang Du
Fixes: 886532aee3cd ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190628102919.2345242-1-arnd@arndb.de
Signed-off-by: Ingo Molnar

Arnd Bergmann
2019-07-25 21:39:26 +0800
68037aa78 locking/lockdep: Hide unused 'class' variable ... Browse Code »

The usage is now hidden in an #ifdef, so we need to move
the variable itself in there as well to avoid this warning:

kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable]

Signed-off-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Bart Van Assche
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: Will Deacon
Cc: Yuyang Du
Cc: frederic@kernel.org
Fixes: 68d41d8c94a3 ("locking/lockdep: Fix lock used or unused stats error")
Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de
Signed-off-by: Ingo Molnar

Arnd Bergmann
2019-07-25 21:39:25 +0800
6ffddfb9e locking/rwsem: Add ACQUIRE comments ... Browse Code »

Since we just reviewed read_slowpath for ACQUIRE correctness, add a
few coments to retain our findings.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-07-25 21:39:25 +0800
99143f82a lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop ... Browse Code »

While reviewing another read_slowpath patch, both Will and I noticed
another missing ACQUIRE, namely:

X = 0;

CPU0 CPU1

rwsem_down_read()
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);

X = 1;
rwsem_up_write();
rwsem_mark_wake()
atomic_long_add(adjustment, &sem->count);
smp_store_release(&waiter->task, NULL);

if (!waiter.task)
break;

...
}

r = X;

Allows 'r == 0'.

Reported-by: Peter Zijlstra (Intel)
Reported-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-07-25 21:39:24 +0800
e1b98fa31 locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty ... Browse Code »

LTP mtest06 has been observed to occasionally hit "still mapped when
deleted" and following BUG_ON on arm64.

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When the pagefault handler (under mmap_sem read lock) calls
find_vma(), vmacache_valid() wrongly reports vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without an ACQUIRE on sem->count:

down_read()
__down_read()
rwsem_down_read_failed()
__rwsem_down_read_failed_common()
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;

The problem can be reproduced by running LTP mtest06 in a loop and
building the kernel (-j $NCPUS) in parallel. It does reproduces since
v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
triggers reliably in about an hour.

The patched kernel ran fine for 10+ hours.

Signed-off-by: Jan Stancek
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Will Deacon
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dbueso@suse.de
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
Signed-off-by: Ingo Molnar

Jan Stancek
2019-07-25 21:39:23 +0800
781343005 locking/rwsem: Don't call owner_on_cpu() on read-owner ... Browse Code »

For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.

As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,

BUG: KASAN: use-after-free in rwsem_down_write_slowpath
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().

Reported-by: Luis Henriques
Tested-by: Luis Henriques
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Jeff Layton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-07-25 21:39:22 +0800

15 Jul, 2019

1 commit

387b14684 docs: locking: convert docs to ReST and rename to *.rst ... Browse Code »

Convert the locking documents to ReST and add them to the
kernel development book where it belongs.

Most of the stuff here is just to make Sphinx to properly
parse the text file, as they're already in good shape,
not requiring massive changes in order to be parsed.

The conversion is actually:
- add blank lines and identation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab
Acked-by: Federico Vaga

Mauro Carvalho Chehab
2019-07-15 19:53:27 +0800

13 Jul, 2019

1 commit

68d41d8c9 locking/lockdep: Fix lock used or unused stats error ... Browse Code »

The stats variable nr_unused_locks is incremented every time a new lock
class is register and decremented when the lock is first used in
__lock_acquire(). And after all, it is shown and checked in lockdep_stats.

However, under configurations that either CONFIG_TRACE_IRQFLAGS or
CONFIG_PROVE_LOCKING is not defined:

The commit:

091806515124b20 ("locking/lockdep: Consolidate lock usage bit initialization")

missed marking the LOCK_USED flag at IRQ usage initialization because
as mark_usage() is not called. And the commit:

886532aee3cd42d ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")

further made mark_lock() not defined such that the LOCK_USED cannot be
marked at all when the lock is first acquired.

As a result, we fix this by not showing and checking the stats under such
configurations for lockdep_stats.

Reported-by: Qian Cai
Signed-off-by: Yuyang Du
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: arnd@arndb.de
Cc: frederic@kernel.org
Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com
Signed-off-by: Ingo Molnar

Yuyang Du
2019-07-13 17:24:53 +0800

09 Jul, 2019

1 commit

e19283286 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The main changes in this cycle are:

- rwsem scalability improvements, phase #2, by Waiman Long, which are
rather impressive:

"On a 2-socket 40-core 80-thread Skylake system with 40 reader
and writer locking threads, the min/mean/max locking operations
done in a 5-second testing window before the patchset were:

40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

After the patchset, they became:

40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

There's a lot of changes to the locking implementation that makes
it similar to qrwlock, including owner handoff for more fair
locking.

Another microbenchmark shows how across the spectrum the
improvements are:

"With a locking microbenchmark running on 5.1 based kernel, the
total locking rates (in kops/s) on a 2-socket Skylake system
with equal numbers of readers and writers (mixed) before and
after this patchset were:

# of Threads Before Patch After Patch
------------ ------------ -----------
2 2,618 4,193
4 1,202 3,726
8 802 3,622
16 729 3,359
32 319 2,826
64 102 2,744"

The changes are extensive and the patch-set has been through
several iterations addressing various locking workloads. There
might be more regressions, but unless they are pathological I
believe we want to use this new implementation as the baseline
going forward.

- jump-label optimizations by Daniel Bristot de Oliveira: the primary
motivation was to remove IPI disturbance of isolated RT-workload
CPUs, which resulted in the implementation of batched jump-label
updates. Beyond the improvement of the real-time characteristics
kernel, in one test this patchset improved static key update
overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
as well.

- atomic64_t cross-arch type cleanups by Mark Rutland: over the last
~10 years of atomic64_t existence the various types used by the
APIs only had to be self-consistent within each architecture -
which means they became wildly inconsistent across architectures.
Mark puts and end to this by reworking all the atomic64
implementations to use 's64' as the base type for atomic64_t, and
to ensure that this type is consistently used for parameters and
return values in the API, avoiding further problems in this area.

- A large set of small improvements to lockdep by Yuyang Du: type
cleanups, output cleanups, function return type and othr cleanups
all around the place.

- A set of percpu ops cleanups and fixes by Peter Zijlstra.

- Misc other changes - please see the Git log for more details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
locking/lockdep: increase size of counters for lockdep statistics
locking/atomics: Use sed(1) instead of non-standard head(1) option
locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
x86/jump_label: Make tp_vec_nr static
x86/percpu: Optimize raw_cpu_xchg()
x86/percpu, sched/fair: Avoid local_clock()
x86/percpu, x86/irq: Relax {set,get}_irq_regs()
x86/percpu: Relax smp_processor_id()
x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
locking/rwsem: Guard against making count negative
locking/rwsem: Adaptive disabling of reader optimistic spinning
locking/rwsem: Enable time-based spinning on reader-owned rwsem
locking/rwsem: Make rwsem->owner an atomic_long_t
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Clarify usage of owner's nonspinaable bit
locking/rwsem: Wake up almost all readers in wait queue
locking/rwsem: More optimal RT task handling of null owner
locking/rwsem: Always release wait_lock before waking up tasks
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Make rwsem_spin_on_owner() return owner state
...

Linus Torvalds
2019-07-09 07:12:03 +0800

29 Jun, 2019

1 commit

83086d654 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

- RCU flavor consolidation cleanups and optmizations
- Documentation updates
- Miscellaneous fixes
- SRCU updates
- RCU-sync flavor consolidation
- Torture-test updates
- Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2019-06-29 01:46:47 +0800

25 Jun, 2019

2 commits

9156e5457 locking/lockdep: increase size of counters for lockdep statistics ... Browse Code »

When system has been running for a long time, signed integer
counters are not enough for some lockdep statistics. Using
unsigned long counters can satisfy the requirement. Besides,
most of lockdep statistics are unsigned. It is better to use
unsigned int instead of int.

Remove unused variables.
- max_recursion_depth
- nr_cyclic_check_recursions
- nr_find_usage_forwards_recursions
- nr_find_usage_backwards_recursions

Signed-off-by: Kobe Wu
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc:
Cc: Eason Lin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: https://lkml.kernel.org/r/1561365348-16050-1-git-send-email-kobe-cp.wu@mediatek.com
Signed-off-by: Ingo Molnar

Kobe Wu
2019-06-25 16:17:08 +0800
886532aee locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING ... Browse Code »

The last cleanup patch triggered another issue, as now another function
should be moved into the same section:

kernel/locking/lockdep.c:3580:12: error: 'mark_lock' defined but not used [-Werror=unused-function]
static int mark_lock(struct task_struct *curr, struct held_lock *this,

Move mark_lock() into the same #ifdef section as its only caller, and
remove the now-unused mark_lock_irq() stub helper.

Signed-off-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Bart Van Assche
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: Yuyang Du
Fixes: 0d2cc3b34532 ("locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190617124718.1232976-1-arnd@arndb.de
Signed-off-by: Ingo Molnar

Arnd Bergmann
2019-06-25 16:17:07 +0800

20 Jun, 2019

1 commit

11ca7a9d5 Merge branches 'consolidate.2019.05.28a', 'doc.2019.05.28a', 'fixes.2019.06.13a'… ... Browse Code »

…, 'srcu.2019.05.28a', 'sync.2019.05.28a' and 'torture.2019.05.28a' into HEAD

consolidate.2019.05.28a: RCU flavor consolidation cleanups and optmizations.
doc.2019.05.28a: Documentation updates.
fixes.2019.06.13a: Miscellaneous fixes.
srcu.2019.05.28a: SRCU updates.
sync.2019.05.28a: RCU-sync flavor consolidation.
torture.2019.05.28a: Torture-test updates.

Paul E. McKenney
2019-06-20 00:21:46 +0800

17 Jun, 2019

8 commits

a15ea1a35 locking/rwsem: Guard against making count negative ... Browse Code »

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:11 +0800
5cfd92e12 locking/rwsem: Adaptive disabling of reader optimistic spinning ... Browse Code »

Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
2) There are just too many readers contending the lock causing it to
take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

Threads Before rspin Before patch After patch %change
------- ------------ ------------ ----------- -------
20 5541068 5345484 5455667 -3.5%/ +2.1%
40 10185150 7292313 9219276 -28.5%/+26.4%
60 8196733 6460517 7181209 -21.2%/+11.2%
80 9508864 6739559 8107025 -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

# of Threads Pre-patch Post-patch
------------ --------- ----------
2 4,730 4,969
4 4,814 4,786
8 4,866 4,815
16 4,715 4,511
32 3,338 3,500
64 3,212 3,389
80 3,110 3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:09 +0800
7d43f1ce9 locking/rwsem: Enable time-based spinning on reader-owned rwsem ... Browse Code »

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

System Time for 16 Iterations
------ ----------------------
1-socket Skylake ~800ns
4-socket Broadwell ~300ns
2-socket ThunderX2 (arm64) ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time. So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

# of Threads Pre-patch Post-patch
------------ --------- ----------
2 1,759 6,684
4 1,684 6,738
8 1,074 7,222
16 900 7,163
32 458 7,316
64 208 520
128 168 425
240 143 474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:07 +0800
94a9717b3 locking/rwsem: Make rwsem->owner an atomic_long_t ... Browse Code »

The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.

New helper functions are added to properly separate out the task
structure pointer and the embedded flags.

Suggested-by: Peter Zijlstra
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:06 +0800
cf69482d6 locking/rwsem: Enable readers spinning on writer ... Browse Code »

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly. The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

# of Threads Pre-patch Post-patch
------------ --------- ----------
4 1,674 1,684
8 1,062 1,074
16 924 900
32 300 458
64 195 208
128 164 168
240 149 143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:05 +0800
02f1082b0 locking/rwsem: Clarify usage of owner's nonspinaable bit ... Browse Code »

Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:03 +0800
d3681e269 locking/rwsem: Wake up almost all readers in wait queue ... Browse Code »

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

# of Threads Pre-Patch Post-patch
------------ --------- ----------
4 1,641 1,674
8 731 1,062
16 564 924
32 78 300
64 38 195
240 50 149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:02 +0800
990fa7384 locking/rwsem: More optimal RT task handling of null owner ... Browse Code »

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

1) The lock owner is in the process of releasing the lock, sem->owner
is cleared but the lock has not been released yet.

2) The lock was free and owner cleared, but another task just comes
in and acquire the lock before we try to get it. The new owner may
be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

Locking threads Pre-patch Post-patch
--------------- --------- -----------
4 2,753 2,608
8 2,529 2,520
16 1,727 1,918
32 1,263 1,956
64 889 1,343

Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Cc: huang ying
Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
Signed-off-by: Ingo Molnar

Waiman Long
2019-06-17 18:28:01 +0800