05 Mar, 2020

1 commit

  • commit a030f9767da1a6bbcec840fc54770eb11c2414b6 upstream.

    It was found that two lines in the output of /proc/lockdep_stats have
    indentation problem:

    # cat /proc/lockdep_stats
    :
    in-process chains: 25057
    stack-trace entries: 137827 [max: 524288]
    number of stack traces: 7973
    number of stack hash chains: 6355
    combined max dependencies: 1356414598
    hardirq-safe locks: 57
    hardirq-unsafe locks: 1286
    :

    All the numbers displayed in /proc/lockdep_stats except the two stack
    trace numbers are formatted with a field with of 11. To properly align
    all the numbers, a field width of 11 is now added to the two stack
    trace numbers.

    Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics")
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Bart Van Assche
    Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

23 Jan, 2020

2 commits

  • commit d91f3057263ceb691ef527e71b41a56b17f6c869 upstream.

    If the lockdep code is really running out of the stack_trace entries,
    it is likely that buffer overrun can happen and the data immediately
    after stack_trace[] will be corrupted.

    If there is less than LOCK_TRACE_SIZE_IN_LONGS entries left before
    the call to save_trace(), the max_entries computation will leave it
    with a very large positive number because of its unsigned nature. The
    subsequent call to stack_trace_save() will then corrupt the data after
    stack_trace[]. Fix that by changing max_entries to a signed integer
    and check for negative value before calling stack_trace_save().

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Bart Van Assche
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 12593b7467f9 ("locking/lockdep: Reduce space occupied by stack traces")
    Link: https://lkml.kernel.org/r/20191220135128.14876-1-longman@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     
  • commit 39e7234f00bc93613c086ae42d852d5f4147120a upstream.

    The commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer
    optimistically spin on owner") will allow a recently woken up waiting
    writer to spin on the owner. Unfortunately, if the owner happens to be
    RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a
    kernel crash. This is fixed by passing the proper non-spinnable bits
    to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated
    as a non-spinnable target.

    Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")

    Reported-by: Christoph Hellwig
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200115154336.8679-1-longman@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

12 Jan, 2020

1 commit

  • [ Upstream commit 1a365e822372ba24c9da0822bc583894f6f3d821 ]

    This fixes various data races in spinlock_debug. By testing with KCSAN,
    it is observable that the console gets spammed with data races reports,
    suggesting these are extremely frequent.

    Example data race report:

    read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2:
    debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline]
    do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112
    __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
    _raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:338 [inline]
    get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873
    get_partial_node mm/slub.c:1870 [inline]

    write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3:
    debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline]
    do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138
    __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline]
    _raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191
    spin_unlock_irqrestore include/linux/spinlock.h:393 [inline]
    free_debug_processing+0x1b3/0x210 mm/slub.c:1214
    __slab_free+0x292/0x400 mm/slub.c:2864

    As a side-effect, with KCSAN, this eventually locks up the console, most
    likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug ->
    KCSAN detects data race -> kcsan_print_report() -> printk lock ->
    deadlock.

    This fix will 1) avoid the data races, and 2) allow using lock debugging
    together with KCSAN.

    Reported-by: Qian Cai
    Signed-off-by: Marco Elver
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Marco Elver
     

25 Sep, 2019

1 commit

  • This patch reverts commit 75437bb304b20 (locking/pvqspinlock: Don't
    wait if vCPU is preempted). A large performance regression was caused
    by this commit. on over-subscription scenarios.

    The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads,
    with three VMs of 80 vCPUs each. The score of ebizzy -M is reduced from
    13000-14000 records/s to 1700-1800 records/s:

    Host Guest score

    vanilla w/o kvm optimizations upstream 1700-1800 records/s
    vanilla w/o kvm optimizations revert 13000-14000 records/s
    vanilla w/ kvm optimizations upstream 4500-5000 records/s
    vanilla w/ kvm optimizations revert 14000-15500 records/s

    Exit from aggressive wait-early mechanism can result in premature yield
    and extra scheduling latency.

    Actually, only 6% of wait_early events are caused by vcpu_is_preempted()
    being true. However, when one vCPU voluntarily releases its vCPU, all
    the subsequently waiters in the queue will do the same and the cascading
    effect leads to bad performance.

    kvm optimizations:
    [1] commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts)
    [2] commit 266e85a5ec9 (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption)

    Tested-by: loobinliu@tencent.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Waiman Long
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: loobinliu@tencent.com
    Cc: stable@vger.kernel.org
    Fixes: 75437bb304b20 (locking/pvqspinlock: Don't wait if vCPU is preempted)
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

17 Sep, 2019

2 commits

  • Pull scheduler updates from Ingo Molnar:

    - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
    Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
    Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

    As perf and the scheduler is getting bigger and more complex,
    document the status quo of current responsibilities and interests,
    and spread the review pain^H^H^H^H fun via an increase in the Cc:
    linecount generated by scripts/get_maintainer.pl. :-)

    - Add another series of patches that brings the -rt (PREEMPT_RT) tree
    closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
    into a new CONFIG_PREEMPTION category that will allow the eventual
    introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
    to go though.

    - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
    allow the finer shaping of CPU bandwidth usage.

    - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

    - Improve the behavior of high CPU count, high thread count
    applications running under cpu.cfs_quota_us constraints.

    - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

    - Improve CPU isolation housekeeping CPU allocation NUMA locality.

    - Fix deadline scheduler bandwidth calculations and logic when cpusets
    rebuilds the topology, or when it gets deadline-throttled while it's
    being offlined.

    - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
    setscheduler() system calls without creating global serialization.
    Add new synchronization between cpuset topology-changing events and
    the deadline acceptance tests in setscheduler(), which were broken
    before.

    - Rework the active_mm state machine to be less confusing and more
    optimal.

    - Rework (simplify) the pick_next_task() slowpath.

    - Improve load-balancing on AMD EPYC systems.

    - ... and misc cleanups, smaller fixes and improvements - please see
    the Git log for more details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    sched/psi: Correct overly pessimistic size calculation
    sched/fair: Speed-up energy-aware wake-ups
    sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
    sched/uclamp: Update CPU's refcount on TG's clamp changes
    sched/uclamp: Use TG's clamps to restrict TASK's clamps
    sched/uclamp: Propagate system defaults to the root group
    sched/uclamp: Propagate parent clamps
    sched/uclamp: Extend CPU's cgroup controller
    sched/topology: Improve load balancing on AMD EPYC systems
    arch, ia64: Make NUMA select SMP
    sched, perf: MAINTAINERS update, add submaintainers and reviewers
    sched/fair: Use rq_lock/unlock in online_fair_sched_group
    cpufreq: schedutil: fix equation in comment
    sched: Rework pick_next_task() slow-path
    sched: Allow put_prev_task() to drop rq->lock
    sched/fair: Expose newidle_balance()
    sched: Add task_struct pointer to sched_class::set_curr_task
    sched: Rework CPU hotplug task selection
    sched/{rt,deadline}: Fix set_next_task vs pick_next_task
    sched: Fix kerneldoc comment for ia64_set_curr_task
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:

    - improve rwsem scalability

    - add uninitialized rwsem debugging check

    - reduce lockdep's stacktrace memory usage and add diagnostics

    - misc cleanups, code consolidation and constification

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mutex: Fix up mutex_waiter usage
    locking/mutex: Use mutex flags macro instead of hard code
    locking/mutex: Make __mutex_owner static to mutex.c
    locking/qspinlock,x86: Clarify virt_spin_lock_key
    locking/rwsem: Check for operations on an uninitialized rwsem
    locking/rwsem: Make handoff writer optimistically spin on owner
    locking/lockdep: Report more stack trace statistics
    locking/lockdep: Reduce space occupied by stack traces
    stacktrace: Constify 'entries' arguments
    locking/lockdep: Make it clear that what lock_class::key points at is not modified

    Linus Torvalds
     

16 Sep, 2019

1 commit


08 Aug, 2019

1 commit

  • The patch moving bits into mutex.c was a little too much; by also
    moving struct mutex_waiter a few less common CONFIGs would no longer
    build.

    Fixes: 5f35d5a66b3e ("locking/mutex: Make __mutex_owner static to mutex.c")
    Signed-off-by: Peter Zijlstra (Intel)

    Peter Zijlstra
     

06 Aug, 2019

4 commits

  • Use the mutex flag macro instead of hard code value inside
    __mutex_owner().

    Signed-off-by: Mukesh Ojha
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mingo@redhat.com
    Cc: will@kernel.org
    Link: https://lkml.kernel.org/r/1564585504-3543-2-git-send-email-mojha@codeaurora.org

    Mukesh Ojha
     
  • __mutex_owner() should only be used by the mutex api's.
    So, to put this restiction let's move the __mutex_owner()
    function definition from linux/mutex.h to mutex.c file.

    There exist functions that uses __mutex_owner() like
    mutex_is_locked() and mutex_trylock_recursive(), So
    to keep legacy thing intact move them as well and
    export them.

    Move mutex_waiter structure also to keep it private to the
    file.

    Signed-off-by: Mukesh Ojha
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: mingo@redhat.com
    Cc: will@kernel.org
    Link: https://lkml.kernel.org/r/1564585504-3543-1-git-send-email-mojha@codeaurora.org

    Mukesh Ojha
     
  • Currently rwsems is the only locking primitive that lacks this
    debug feature. Add it under CONFIG_DEBUG_RWSEMS and do the magic
    checking in the locking fastpath (trylock) operation such that
    we cover all cases. The unlocking part is pretty straightforward.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: mingo@kernel.org
    Cc: Davidlohr Bueso
    Link: https://lkml.kernel.org/r/20190729044735.9632-1-dave@stgolabs.net

    Davidlohr Bueso
     
  • When the handoff bit is set by a writer, no other tasks other than
    the setting writer itself is allowed to acquire the lock. If the
    to-be-handoff'ed writer goes to sleep, there will be a wakeup latency
    period where the lock is free, but no one can acquire it. That is less
    than ideal.

    To reduce that latency, the handoff writer will now optimistically spin
    on the owner if it happens to be a on-cpu writer. It will spin until
    it releases the lock and the to-be-handoff'ed writer can then acquire
    the lock immediately without any delay. Of course, if the owner is not
    a on-cpu writer, the to-be-handoff'ed writer will have to sleep anyway.

    The optimistic spinning code is also modified to not stop spinning
    when the handoff bit is set. This will prevent an occasional setting of
    handoff bit from causing a bunch of optimistic spinners from entering
    into the wait queue causing significant reduction in throughput.

    On a 1-socket 22-core 44-thread Skylake system, the AIM7 shared_memory
    workload was run with 7000 users. The throughput (jobs/min) of the
    following kernels were as follows:

    1) 5.2-rc6
    - 8,092,486
    2) 5.2-rc6 + tip's rwsem patches
    - 7,567,568
    3) 5.2-rc6 + tip's rwsem patches + this patch
    - 7,954,545

    Using perf-record(1), the %cpu time used by rwsem_down_write_slowpath(),
    rwsem_down_write_failed() and their callees for the 3 kernels were 1.70%,
    5.46% and 2.08% respectively.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: x86@kernel.org
    Cc: Ingo Molnar
    Cc: Will Deacon
    Cc: huang ying
    Cc: Tim Chen
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: Thomas Gleixner
    Cc: Davidlohr Bueso
    Cc: "H. Peter Anvin"
    Link: https://lkml.kernel.org/r/20190625143913.24154-1-longman@redhat.com

    Waiman Long
     

02 Aug, 2019

1 commit


25 Jul, 2019

11 commits

  • Returning the pointer that was passed in allows us to write
    slightly more idiomatic code. Convert a few users.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
    Signed-off-by: Ingo Molnar

    Matthew Wilcox (Oracle)
     
  • Report the number of stack traces and the number of stack trace hash
    chains. These two numbers are useful because these allow to estimate
    the number of stack trace hash collisions.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20190722182443.216015-5-bvanassche@acm.org
    Signed-off-by: Ingo Molnar

    Bart Van Assche
     
  • Although commit 669de8bda87b ("kernel/workqueue: Use dynamic lockdep keys
    for workqueues") unregisters dynamic lockdep keys when a workqueue is
    destroyed, a side effect of that commit is that all stack traces
    associated with the lockdep key are leaked when a workqueue is destroyed.
    Fix this by storing each unique stack trace once. Other changes in this
    patch are:

    - Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
    - Store a pointer to a stack trace in struct lock_class and struct
    lock_list instead of storing 'nr_entries' and 'offset'.

    This patch avoids that the following program triggers the "BUG:
    MAX_STACK_TRACE_ENTRIES too low!" complaint:

    #include
    #include

    int main()
    {
    for (;;) {
    int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
    close(fd);
    }
    }

    Suggested-by: Peter Zijlstra
    Reported-by: Eric Biggers
    Signed-off-by: Bart Van Assche
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: Yuyang Du
    Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
    Signed-off-by: Ingo Molnar

    Bart Van Assche
     
  • This patch does not change the behavior of the lockdep code.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20190722182443.216015-2-bvanassche@acm.org
    Signed-off-by: Ingo Molnar

    Bart Van Assche
     
  • An uninitialized/ zeroed mutex will go unnoticed because there is no
    check for it. There is a magic check in the unlock's slowpath path which
    might go unnoticed if the unlock happens in the fastpath.

    Add a ->magic check early in the mutex_lock() and mutex_trylock() path.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     
  • As Will Deacon points out, CONFIG_PROVE_LOCKING implies TRACE_IRQFLAGS,
    so the conditions I added in the previous patch, and some others in the
    same file can be simplified by only checking for the former.

    No functional change.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Bart Van Assche
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Yuyang Du
    Fixes: 886532aee3cd ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
    Link: https://lkml.kernel.org/r/20190628102919.2345242-1-arnd@arndb.de
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     
  • The usage is now hidden in an #ifdef, so we need to move
    the variable itself in there as well to avoid this warning:

    kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable]

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Bart Van Assche
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: Will Deacon
    Cc: Yuyang Du
    Cc: frederic@kernel.org
    Fixes: 68d41d8c94a3 ("locking/lockdep: Fix lock used or unused stats error")
    Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     
  • Since we just reviewed read_slowpath for ACQUIRE correctness, add a
    few coments to retain our findings.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While reviewing another read_slowpath patch, both Will and I noticed
    another missing ACQUIRE, namely:

    X = 0;

    CPU0 CPU1

    rwsem_down_read()
    for (;;) {
    set_current_state(TASK_UNINTERRUPTIBLE);

    X = 1;
    rwsem_up_write();
    rwsem_mark_wake()
    atomic_long_add(adjustment, &sem->count);
    smp_store_release(&waiter->task, NULL);

    if (!waiter.task)
    break;

    ...
    }

    r = X;

    Allows 'r == 0'.

    Reported-by: Peter Zijlstra (Intel)
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • LTP mtest06 has been observed to occasionally hit "still mapped when
    deleted" and following BUG_ON on arm64.

    The extra mapcount originated from pagefault handler, which handled
    pagefault for vma that has already been detached. vma is detached
    under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
    also invalidates vmacache.

    When the pagefault handler (under mmap_sem read lock) calls
    find_vma(), vmacache_valid() wrongly reports vmacache as valid.

    After rwsem down_read() returns via 'queue empty' path (as of v5.2),
    it does so without an ACQUIRE on sem->count:

    down_read()
    __down_read()
    rwsem_down_read_failed()
    __rwsem_down_read_failed_common()
    raw_spin_lock_irq(&sem->wait_lock);
    if (list_empty(&sem->wait_list)) {
    if (atomic_long_read(&sem->count) >= 0) {
    raw_spin_unlock_irq(&sem->wait_lock);
    return sem;

    The problem can be reproduced by running LTP mtest06 in a loop and
    building the kernel (-j $NCPUS) in parallel. It does reproduces since
    v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
    triggers reliably in about an hour.

    The patched kernel ran fine for 10+ hours.

    Signed-off-by: Jan Stancek
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Will Deacon
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dbueso@suse.de
    Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
    Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
    Signed-off-by: Ingo Molnar

    Jan Stancek
     
  • For writer, the owner value is cleared on unlock. For reader, it is
    left intact on unlock for providing better debugging aid on crash dump
    and the unlock of one reader may not mean the lock is free.

    As a result, the owner_on_cpu() shouldn't be used on read-owner
    as the task pointer value may not be valid and it might have
    been freed. That is the case in rwsem_spin_on_owner(), but not in
    rwsem_can_spin_on_owner(). This can lead to use-after-free error from
    KASAN. For example,

    BUG: KASAN: use-after-free in rwsem_down_write_slowpath
    (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
    /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

    Fix this by checking for RWSEM_READER_OWNED flag before calling
    owner_on_cpu().

    Reported-by: Luis Henriques
    Tested-by: Luis Henriques
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Jeff Layton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t")
    Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

15 Jul, 2019

1 commit

  • Convert the locking documents to ReST and add them to the
    kernel development book where it belongs.

    Most of the stuff here is just to make Sphinx to properly
    parse the text file, as they're already in good shape,
    not requiring massive changes in order to be parsed.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Federico Vaga

    Mauro Carvalho Chehab
     

13 Jul, 2019

1 commit

  • The stats variable nr_unused_locks is incremented every time a new lock
    class is register and decremented when the lock is first used in
    __lock_acquire(). And after all, it is shown and checked in lockdep_stats.

    However, under configurations that either CONFIG_TRACE_IRQFLAGS or
    CONFIG_PROVE_LOCKING is not defined:

    The commit:

    091806515124b20 ("locking/lockdep: Consolidate lock usage bit initialization")

    missed marking the LOCK_USED flag at IRQ usage initialization because
    as mark_usage() is not called. And the commit:

    886532aee3cd42d ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")

    further made mark_lock() not defined such that the LOCK_USED cannot be
    marked at all when the lock is first acquired.

    As a result, we fix this by not showing and checking the stats under such
    configurations for lockdep_stats.

    Reported-by: Qian Cai
    Signed-off-by: Yuyang Du
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: arnd@arndb.de
    Cc: frederic@kernel.org
    Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com
    Signed-off-by: Ingo Molnar

    Yuyang Du
     

09 Jul, 2019

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • …k/linux-rcu into core/rcu

    Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

    - RCU flavor consolidation cleanups and optmizations
    - Documentation updates
    - Miscellaneous fixes
    - SRCU updates
    - RCU-sync flavor consolidation
    - Torture-test updates
    - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

25 Jun, 2019

2 commits

  • When system has been running for a long time, signed integer
    counters are not enough for some lockdep statistics. Using
    unsigned long counters can satisfy the requirement. Besides,
    most of lockdep statistics are unsigned. It is better to use
    unsigned int instead of int.

    Remove unused variables.
    - max_recursion_depth
    - nr_cyclic_check_recursions
    - nr_find_usage_forwards_recursions
    - nr_find_usage_backwards_recursions

    Signed-off-by: Kobe Wu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc: Eason Lin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/1561365348-16050-1-git-send-email-kobe-cp.wu@mediatek.com
    Signed-off-by: Ingo Molnar

    Kobe Wu
     
  • The last cleanup patch triggered another issue, as now another function
    should be moved into the same section:

    kernel/locking/lockdep.c:3580:12: error: 'mark_lock' defined but not used [-Werror=unused-function]
    static int mark_lock(struct task_struct *curr, struct held_lock *this,

    Move mark_lock() into the same #ifdef section as its only caller, and
    remove the now-unused mark_lock_irq() stub helper.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Bart Van Assche
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: Yuyang Du
    Fixes: 0d2cc3b34532 ("locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
    Link: https://lkml.kernel.org/r/20190617124718.1232976-1-arnd@arndb.de
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     

20 Jun, 2019

1 commit


17 Jun, 2019

8 commits

  • The upper bits of the count field is used as reader count. When
    sufficient number of active readers are present, the most significant
    bit will be set and the count becomes negative. If the number of active
    readers keep on piling up, we may eventually overflow the reader counts.
    This is not likely to happen unless the number of bits reserved for
    reader count is reduced because those bits are need for other purpose.

    To prevent this count overflow from happening, the most significant
    bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
    attempts will now fail for both the fast and slow paths whenever this
    bit is set. So all those extra readers will be put to sleep in the wait
    list. Wakeup will not happen until the reader count reaches 0.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Reader optimistic spinning is helpful when the reader critical section
    is short and there aren't that many readers around. It makes readers
    relatively more preferred than writers. When a writer times out spinning
    on a reader-owned lock and set the nospinnable bits, there are two main
    reasons for that.

    1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
    2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

    In the former case, long reader critical section will impede the progress
    of writers which is usually more important for system performance.
    In the later case, reader optimistic spinning tends to make the reader
    groups that contain readers that acquire the lock together smaller
    leading to more of them. That may hurt performance in some cases. In
    other words, the setting of nonspinnable bits indicates that reader
    optimistic spinning may not be helpful for those workloads that cause it.

    Therefore, any writers that have observed the setting of the writer
    nonspinnable bit for a given rwsem after they fail to acquire the lock
    via optimistic spinning will set the reader nonspinnable bit once they
    acquire the write lock. Similarly, readers that observe the setting
    of reader nonspinnable bit at slowpath entry will also set the reader
    nonspinnable bit when they acquire the read lock via the wakeup path.

    Once the reader nonspinnable bit is on, it will only be reset when
    a writer is able to acquire the rwsem in the fast path or somehow a
    reader or writer in the slowpath doesn't observe the nonspinable bit.

    This is to discourage reader optmistic spinning on that particular
    rwsem and make writers more preferred. This adaptive disabling of reader
    optimistic spinning will alleviate some of the negative side effect of
    this feature.

    In addition, this patch tries to make readers in the spinning queue
    follow the phase-fair principle after quitting optimistic spinning
    by checking if another reader has somehow acquired a read lock after
    this reader enters the optimistic spinning queue. If so and the rwsem
    is still reader-owned, this reader is in the right read-phase and can
    attempt to acquire the lock.

    On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
    the will-it-scale benchmark was run with various number of threads. The
    number of operations done before reader optimistic spinning patches,
    this patch and after this patch were:

    Threads Before rspin Before patch After patch %change
    ------- ------------ ------------ ----------- -------
    20 5541068 5345484 5455667 -3.5%/ +2.1%
    40 10185150 7292313 9219276 -28.5%/+26.4%
    60 8196733 6460517 7181209 -21.2%/+11.2%
    80 9508864 6739559 8107025 -29.1%/+20.3%

    This patch doesn't recover all the lost performance, but it is more
    than half. Given the fact that reader optimistic spinning does benefit
    some workloads, this is a good compromise.

    Using the rwsem locking microbenchmark with very short critical section,
    this patch doesn't have too much impact on locking performance as shown
    by the locking rates (kops/s) below with equal numbers of readers and
    writers before and after this patch:

    # of Threads Pre-patch Post-patch
    ------------ --------- ----------
    2 4,730 4,969
    4 4,814 4,786
    8 4,866 4,815
    16 4,715 4,511
    32 3,338 3,500
    64 3,212 3,389
    80 3,110 3,044

    When running the locking microbenchmark with 40 dedicated reader and writer
    threads, however, the reader performance is curtailed to favor the writer.

    Before patch:

    40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
    40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

    After patch:

    40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
    40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When the rwsem is owned by reader, writers stop optimistic spinning
    simply because there is no easy way to figure out if all the readers
    are actively running or not. However, there are scenarios where
    the readers are unlikely to sleep and optimistic spinning can help
    performance.

    This patch provides a simple mechanism for spinning on a reader-owned
    rwsem by a writer. It is a time threshold based spinning where the
    allowable spinning time can vary from 10us to 25us depending on the
    condition of the rwsem.

    When the time threshold is exceeded, the nonspinnable bits will be set
    in the owner field to indicate that no more optimistic spinning will
    be allowed on this rwsem until it becomes writer owned again. Not even
    readers is allowed to acquire the reader-locked rwsem by optimistic
    spinning for fairness.

    We also want a writer to acquire the lock after the readers hold the
    lock for a relatively long time. In order to give preference to writers
    under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
    into two - one for reader and one for writer. When optimistic spinning
    is disabled, both bits will be set. When the reader count drop down
    to 0, the writer nonspinnable bit will be cleared to allow writers to
    spin on the lock, but not the readers. When a writer acquires the lock,
    it will write its own task structure pointer into sem->owner and clear
    the reader nonspinnable bit in the process.

    The time taken for each iteration of the reader-owned rwsem spinning
    loop varies. Below are sample minimum elapsed times for 16 iterations
    of the loop.

    System Time for 16 Iterations
    ------ ----------------------
    1-socket Skylake ~800ns
    4-socket Broadwell ~300ns
    2-socket ThunderX2 (arm64) ~250ns

    When the lock cacheline is contended, we can see up to almost 10X
    increase in elapsed time. So 25us will be at most 500, 1300 and 1600
    iterations for each of the above systems.

    With a locking microbenchmark running on 5.1 based kernel, the total
    locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
    equal numbers of readers and writers before and after this patch were
    as follows:

    # of Threads Pre-patch Post-patch
    ------------ --------- ----------
    2 1,759 6,684
    4 1,684 6,738
    8 1,074 7,222
    16 900 7,163
    32 458 7,316
    64 208 520
    128 168 425
    240 143 474

    This patch gives a big boost in performance for mixed reader/writer
    workloads.

    With 32 locking threads, the rwsem lock event data were:

    rwsem_opt_fail=79850
    rwsem_opt_nospin=5069
    rwsem_opt_rlock=597484
    rwsem_opt_wlock=957339
    rwsem_sleep_reader=57782
    rwsem_sleep_writer=55663

    With 64 locking threads, the data looked like:

    rwsem_opt_fail=346723
    rwsem_opt_nospin=6293
    rwsem_opt_rlock=1127119
    rwsem_opt_wlock=1400628
    rwsem_sleep_reader=308201
    rwsem_sleep_writer=72281

    So a lot more threads acquired the lock in the slowpath and more threads
    went to sleep.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • The rwsem->owner contains not just the task structure pointer, it also
    holds some flags for storing the current state of the rwsem. Some of
    the flags may have to be atomically updated. To reflect the new reality,
    the owner is now changed to an atomic_long_t type.

    New helper functions are added to properly separate out the task
    structure pointer and the embedded flags.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch enables readers to optimistically spin on a
    rwsem when it is owned by a writer instead of going to sleep
    directly. The rwsem_can_spin_on_owner() function is extracted
    out of rwsem_optimistic_spin() and is called directly by
    rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

    With a locking microbenchmark running on 5.1 based kernel, the total
    locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
    numbers of readers and writers before and after the patch were as
    follows:

    # of Threads Pre-patch Post-patch
    ------------ --------- ----------
    4 1,674 1,684
    8 1,062 1,074
    16 924 900
    32 300 458
    64 195 208
    128 164 168
    240 149 143

    The performance change wasn't significant in this case, but this change
    is required by a follow-on patch.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
    anonymous owner - readers or an anonymous writer. The setting of this
    anonymous bit is used as an indicator that optimistic spinning cannot
    be done on this rwsem.

    With the upcoming reader optimistic spinning patches, a reader-owned
    rwsem can be spinned on for a limit period of time. We still need
    this bit to indicate a rwsem is nonspinnable, but not setting this
    bit loses its meaning that the owner is known. So rename the bit
    to RWSEM_NONSPINNABLE to clarify its meaning.

    This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When the front of the wait queue is a reader, other readers
    immediately following the first reader will also be woken up at the
    same time. However, if there is a writer in between. Those readers
    behind the writer will not be woken up.

    Because of optimistic spinning, the lock acquisition order is not FIFO
    anyway. The lock handoff mechanism will ensure that lock starvation
    will not happen.

    Assuming that the lock hold times of the other readers still in the
    queue will be about the same as the readers that are being woken up,
    there is really not much additional cost other than the additional
    latency due to the wakeup of additional tasks by the waker. Therefore
    all the readers up to a maximum of 256 in the queue are woken up when
    the first waiter is a reader to improve reader throughput. This is
    somewhat similar in concept to a phase-fair R/W lock.

    With a locking microbenchmark running on 5.1 based kernel, the total
    locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
    equal numbers of readers and writers before and after this patch were
    as follows:

    # of Threads Pre-Patch Post-patch
    ------------ --------- ----------
    4 1,641 1,674
    8 731 1,062
    16 564 924
    32 78 300
    64 38 195
    240 50 149

    There is no performance gain at low contention level. At high contention
    level, however, this patch gives a pretty decent performance boost.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • An RT task can do optimistic spinning only if the lock holder is
    actually running. If the state of the lock holder isn't known, there
    is a possibility that high priority of the RT task may block forward
    progress of the lock holder if it happens to reside on the same CPU.
    This will lead to deadlock. So we have to make sure that an RT task
    will not spin on a reader-owned rwsem.

    When the owner is temporarily set to NULL, there are two cases
    where we may want to continue spinning:

    1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

    2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

    So an RT task is now made to retry one more time to see if it can
    acquire the lock or continue spinning on the new owning writer.

    When testing on a 8-socket IvyBridge-EX system, the one additional retry
    seems to improve locking performance of RT write locking threads under
    heavy contentions. The table below shows the locking rates (in kops/s)
    with various write locking threads before and after the patch.

    Locking threads Pre-patch Post-patch
    --------------- --------- -----------
    4 2,753 2,608
    8 2,529 2,520
    16 1,727 1,918
    32 1,263 1,956
    64 889 1,343

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Cc: huang ying
    Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long