29 Oct, 2018

1 commit

  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

29 Sep, 2018

1 commit

  • commit d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe upstream.

    When a task which previously ran on a given CPU is remotely queued to
    wake up on that same CPU, there is a period where the task's state is
    TASK_WAKING and its vruntime is not normalized. This is not accounted
    for in vruntime_normalized() which will cause an error in the task's
    vruntime if it is switched from the fair class during this time.

    For example if it is boosted to RT priority via rt_mutex_setprio(),
    rq->min_vruntime will not be subtracted from the task's vruntime but
    it will be added again when the task returns to the fair class. The
    task's vruntime will have been erroneously doubled and the effective
    priority of the task will be reduced.

    Note this will also lead to inflation of all vruntimes since the doubled
    vruntime value will become the rq's min_vruntime when other tasks leave
    the rq. This leads to repeated doubling of the vruntime and priority
    penalty.

    Fix this by recognizing a WAKING task's vruntime as normalized only if
    sched_remote_wakeup is true. This indicates a migration, in which case
    the vruntime would have been normalized in migrate_task_rq_fair().

    Based on a similar patch from John Dias .

    Suggested-by: Peter Zijlstra
    Tested-by: Dietmar Eggemann
    Signed-off-by: Steve Muckle
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Redpath
    Cc: John Dias
    Cc: Linus Torvalds
    Cc: Miguel de Dios
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Paul Turner
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: kernel-team@android.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steve Muckle
     

26 Sep, 2018

2 commits

  • [ Upstream commit 8fe5c5a937d0f4e84221631833a2718afde52285 ]

    When a new task wakes-up for the first time, its initial utilization
    is set to half of the spare capacity of its CPU. The current
    implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
    directly as a capacity reference. As a result, on a big.LITTLE system, a
    new task waking up on an idle little CPU will be given ~512 of util_avg,
    even if the CPU's capacity is significantly less than that.

    Fix this by computing the spare capacity with arch_scale_cpu_capacity().

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Vincent Guittot
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Quentin Perret
     
  • [ Upstream commit 76e079fefc8f62bd9b2cd2950814d1ee806e31a5 ]

    wake_woken_function() synchronizes with wait_woken() as follows:

    [wait_woken] [wake_woken_function]

    entry->flags &= ~wq_flag_woken; condition = true;
    smp_mb(); smp_wmb();
    if (condition) wq_entry->flags |= wq_flag_woken;
    break;

    This commit replaces the above smp_wmb() with an smp_mb() in order to
    guarantee that either wait_woken() sees the wait condition being true
    or the store to wq_entry->flags in woken_wake_function() follows the
    store in wait_woken() in the coherence order (so that the former can
    eventually be observed by wait_woken()).

    The commit also fixes a comment associated to set_current_state() in
    wait_woken(): the comment pairs the barrier in set_current_state() to
    the above smp_wmb(), while the actual pairing involves the barrier in
    set_current_state() and the barrier executed by the try_to_wake_up()
    in wake_woken_function().

    Signed-off-by: Andrea Parri
    Signed-off-by: Paul E. McKenney
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akiyks@gmail.com
    Cc: boqun.feng@gmail.com
    Cc: dhowells@redhat.com
    Cc: j.alglave@ucl.ac.uk
    Cc: linux-arch@vger.kernel.org
    Cc: luc.maranget@inria.fr
    Cc: npiggin@gmail.com
    Cc: parri.andrea@gmail.com
    Cc: stern@rowland.harvard.edu
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     

15 Sep, 2018

1 commit

  • commit 295d6d5e373607729bcc8182c25afe964655714f upstream.

    Fix a bug introduced in:

    72f9f3fdc928 ("sched/deadline: Remove dl_new from struct sched_dl_entity")

    After that commit, when switching to -deadline if the scheduling
    deadline of a task is in the past then switched_to_dl() calls
    setup_new_entity() to properly initialize the scheduling deadline
    and runtime.

    The problem is that the task is enqueued _before_ having its parameters
    initialized by setup_new_entity(), and this can cause problems.
    For example, a task with its out-of-date deadline in the past will
    potentially be enqueued as the highest priority one; however, its
    adjusted deadline may not be the earliest one.

    This patch fixes the problem by initializing the task's parameters before
    enqueuing it.

    Signed-off-by: luca abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     

05 Sep, 2018

1 commit

  • [ Upstream commit f3d133ee0a17d5694c6f21873eec9863e11fa423 ]

    NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
    runtime with a spin-rt-task.

    However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
    enough rt_runtime at the beginning, rt_runtime can't be restored to
    its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.

    E.g. on my PC with 4 cores, procedure to reproduce:
    1) Make sure RT_RUNTIME_SHARE is enabled
    cat /sys/kernel/debug/sched_features
    GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
    CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
    LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
    NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
    ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
    2) Start a spin-rt-task
    ./loop_rr &
    3) set affinity to the last cpu
    taskset -p 8 $pid_of_loop_rr
    4) Observe that last cpu have borrowed enough runtime.
    cat /proc/sched_debug | grep rt_runtime
    .rt_runtime : 950.000000
    .rt_runtime : 900.000000
    .rt_runtime : 950.000000
    .rt_runtime : 1000.000000
    5) Disable RT_RUNTIME_SHARE
    echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
    6) Observe that rt_runtime can not been restored
    cat /proc/sched_debug | grep rt_runtime
    .rt_runtime : 950.000000
    .rt_runtime : 900.000000
    .rt_runtime : 950.000000
    .rt_runtime : 1000.000000

    This patch help to restore rt_runtime after we disable
    RT_RUNTIME_SHARE.

    Signed-off-by: Hailong Liu
    Signed-off-by: Jiang Biao
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: zhong.weidong@zte.com.cn
    Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hailong Liu
     

16 Aug, 2018

1 commit

  • commit ba2591a5993eabcc8e874e30f361d8ffbb10d6d4 upstream

    The static key sched_smt_present is only updated at boot time when SMT
    siblings have been detected. Booting with maxcpus=1 and bringing the
    siblings online after boot rebuilds the scheduling domains correctly but
    does not update the static key, so the SMT code is not enabled.

    Let the key be updated in the scheduler CPU hotplug code to fix this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

08 Jul, 2018

2 commits

  • [ Upstream commit 7af443ee1697607541c6346c87385adab2214743 ]

    select_task_rq() is used in a few paths to select the CPU upon which a
    thread should be run - for example it is used by try_to_wake_up() & by
    fork or exec balancing. As-is it allows use of any online CPU that is
    present in the task's cpus_allowed mask.

    This presents a problem because there is a period whilst CPUs are
    brought online where a CPU is marked online, but is not yet fully
    initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <
    CPUHP_ONLINE. Usually we don't run any user tasks during this window,
    but there are corner cases where this can happen. An example observed
    is:

    - Some user task A, running on CPU X, forks to create task B.

    - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
    task_struct::cpu field to X.

    - CPU X is offlined.

    - Task A, currently somewhere between the __set_task_cpu() in
    copy_process() and the call to wake_up_new_task(), is migrated to
    CPU Y by migrate_tasks() when CPU X is offlined.

    - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
    scheduler is now active on CPU X, but there are no user tasks on
    the runqueue.

    - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
    select_task_rq() with cpu=X, taken from task B's task_struct,
    and select_task_rq() allows CPU X to be returned.

    - Task A enqueues task B on CPU X's runqueue, via activate_task() &
    enqueue_task().

    - CPU X now has a user task on its runqueue before it has reached the
    CPUHP_ONLINE state.

    In most cases, the user tasks that schedule on the newly onlined CPU
    have no idea that anything went wrong, but one case observed to be
    problematic is if the task goes on to invoke the sched_setaffinity
    syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
    before the CPU that brought it online calls stop_machine_unpark(). This
    means that for a portion of the window of time between
    CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
    cpu_stopper has its enabled field set to false. If a user thread is
    executed on the CPU during this window and it invokes sched_setaffinity
    with a CPU mask that does not include the CPU it's running on, then when
    __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
    migration_cpu_stop() and perform the actual migration away from the CPU
    it will simply return -ENOENT rather than calling migration_cpu_stop().
    We then return from the sched_setaffinity syscall back to the user task
    that is now running on a CPU which it just asked not to run on, and
    which is not present in its cpus_allowed mask.

    This patch resolves the problem by having select_task_rq() enforce that
    user tasks run on CPUs that are active - the same requirement that
    select_fallback_rq() already enforces. This should ensure that newly
    onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
    schedule user tasks, and also implies that bringup_wait_for_ap() will
    have called stop_machine_unpark() which resolves the sched_setaffinity
    issue above.

    I haven't yet investigated them, but it may be of interest to review
    whether any of the actions performed by hotplug states between
    CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
    effects on user tasks that might schedule before they are reached, which
    might widen the scope of the problem from just affecting the behaviour
    of sched_setaffinity.

    Signed-off-by: Paul Burton
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Burton
     
  • [ Upstream commit 175f0e25abeaa2218d431141ce19cf1de70fa82d ]

    As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
    for running on an online && !active CPU are stricter than just being a
    kthread, you need to be a per-cpu kthread.

    If you're not strictly per-CPU, you have better CPUs to run on and
    don't need the partially booted one to get your work done.

    The exception is to allow smpboot threads to bootstrap the CPU itself
    and get kernel 'services' initialized before we allow userspace on it.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs")
    Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

21 Jun, 2018

3 commits

  • [ Upstream commit 3febfc8a219a036633b57a34c6678e21b6a0580d ]

    Since the grub_reclaim() function can be made static, make it so.

    Silences the following GCC warning (W=1):

    kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit f6a3463063f42d9fb2c78f386437a822e0ad1792 ]

    In the following commit:

    6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")

    the print_cfs_rq() prototype was added to ,
    right next to the prototypes for print_cfs_stats(), print_rt_stats()
    and print_dl_stats().

    Finish this previous commit and also move related prototypes for
    print_rt_rq() and print_dl_rq().

    Remove existing extern declarations now that they not needed anymore.

    Silences the following GCC warning, triggered by W=1:

    kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
    kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit b5bf9a90bbebffba888c9144c5a8a10317b04064 ]

    Gaurav reported a perceived problem with TASK_PARKED, which turned out
    to be a broken wait-loop pattern in __kthread_parkme(), but the
    reported issue can (and does) in fact happen for states that do not do
    condition based sleeps.

    When the 'current->state = TASK_RUNNING' store of a previous
    (concurrent) try_to_wake_up() collides with the setting of a 'special'
    sleep state, we can loose the sleep state.

    Normal condition based wait-loops are immune to this problem, but for
    sleep states that are not condition based are subject to this problem.

    There already is a fix for TASK_DEAD. Abstract that and also apply it
    to TASK_STOPPED and TASK_TRACED, both of which are also without
    condition based wait-loop.

    Reported-by: Gaurav Kohli
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

30 May, 2018

1 commit

  • [ Upstream commit d29a20645d5e929aa7e8616f28e5d8e1c49263ec ]

    While running rt-tests' pi_stress program I got the following splat:

    rq->clock_update_flags < RQCF_ACT_SKIP
    WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20

    [...]


    enqueue_top_rt_rq+0xf4/0x150
    ? cpufreq_dbs_governor_start+0x170/0x170
    sched_rt_rq_enqueue+0x65/0x80
    sched_rt_period_timer+0x156/0x360
    ? sched_rt_rq_enqueue+0x80/0x80
    __hrtimer_run_queues+0xfa/0x260
    hrtimer_interrupt+0xcb/0x220
    smp_apic_timer_interrupt+0x62/0x120
    apic_timer_interrupt+0xf/0x20

    [...]

    do_idle+0x183/0x1e0
    cpu_startup_entry+0x5f/0x70
    start_secondary+0x192/0x1d0
    secondary_startup_64+0xa5/0xb0

    We can get rid of it be the "traditional" means of adding an
    update_rq_clock() call after acquiring the rq->lock in
    do_sched_rt_period_timer().

    The case for the RT task throttling (which this workload also hits)
    can be ignored in that the skip_update call is actually bogus and
    quite the contrary (the request bits are removed/reverted).

    By setting RQCF_UPDATED we really don't care if the skip is happening
    or not and will therefore make the assert_clock_updated() check happy.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Matt Fleming
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: linux-kernel@vger.kernel.org
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Davidlohr Bueso
     

16 May, 2018

2 commits

  • commit 354d7793070611b4df5a79fbb0f12752d0ed0cc5 upstream.

    > kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight'

    Userspace controls @nice, sanitize the array index.

    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 97739501f207efe33145b918817f305b822987f8 upstream.

    If the next_freq field of struct sugov_policy is set to UINT_MAX,
    it shouldn't be used for updating the CPU frequency (this is a
    special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq:
    schedutil: Avoid reducing frequency of busy CPUs prematurely) it
    may be passed as the new frequency to sugov_update_commit() in
    sugov_update_single().

    Fix that by adding an extra check for the special UINT_MAX value
    of next_freq to sugov_update_single().

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Reported-by: Viresh Kumar
    Cc: 4.12+ # 4.12+
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

19 Mar, 2018

2 commits

  • [ Upstream commit a0982dfa03efca6c239c52cabebcea4afb93ea6b ]

    The rcutorture test suite occasionally provokes a splat due to invoking
    resched_cpu() on an offline CPU:

    WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
    Modules linked in:
    CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff902ede9daf00 task.stack: ffff96c50010c000
    RIP: 0010:native_smp_send_reschedule+0x37/0x40
    RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
    RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
    RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
    RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
    R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
    FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
    Call Trace:
    resched_curr+0x8f/0x1c0
    resched_cpu+0x2c/0x40
    rcu_implicit_dynticks_qs+0x152/0x220
    force_qs_rnp+0x147/0x1d0
    ? sync_rcu_exp_select_cpus+0x450/0x450
    rcu_gp_kthread+0x5a9/0x950
    kthread+0x142/0x180
    ? force_qs_rnp+0x1d0/0x1d0
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x27/0x40
    Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
    ---[ end trace 26df9e5df4bba4ac ]---

    This splat cannot be generated by expedited grace periods because they
    always invoke resched_cpu() on the current CPU, which is good because
    expedited grace periods require that resched_cpu() unconditionally
    succeed. However, other parts of RCU can tolerate resched_cpu() acting
    as a no-op, at least as long as it doesn't happen too often.

    This commit therefore makes resched_cpu() invoke resched_curr() only if
    the CPU is either online or is the current CPU.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • [ Upstream commit 2fe2582649aa2355f79acddb86bd4d6c5363eb63 ]

    The rcutorture test suite occasionally provokes a splat due to invoking
    rt_mutex_lock() which needs to boost the priority of a task currently
    sitting on a runqueue that belongs to an offline CPU:

    WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
    Modules linked in:
    CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
    RIP: 0010:native_smp_send_reschedule+0x37/0x40
    RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
    RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
    RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
    RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
    R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
    R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
    FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
    Call Trace:
    resched_curr+0x61/0xd0
    switched_to_rt+0x8f/0xa0
    rt_mutex_setprio+0x25c/0x410
    task_blocks_on_rt_mutex+0x1b3/0x1f0
    rt_mutex_slowlock+0xa9/0x1e0
    rt_mutex_lock+0x29/0x30
    rcu_boost_kthread+0x127/0x3c0
    kthread+0x104/0x140
    ? rcu_report_unblock_qs_rnp+0x90/0x90
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x22/0x30
    Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48

    But the target task's priority has already been adjusted, so the only
    purpose of switched_to_rt() invoking resched_curr() is to wake up the
    CPU running some task that needs to be preempted by the boosted task.
    But the CPU is offline, which presumably means that the task must be
    migrated to some other CPU, and that this other CPU will undertake any
    needed preemption at the time of migration. Because the runqueue lock
    is held when resched_curr() is invoked, we know that the boosted task
    cannot go anywhere, so it is not necessary to invoke resched_curr()
    in this particular case.

    This commit therefore makes switched_to_rt() refrain from invoking
    resched_curr() when the target CPU is offline.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     

17 Feb, 2018

3 commits

  • commit 364f56653708ba8bcdefd4f0da2a42904baa8eeb upstream.

    When issuing an IPI RT push, where an IPI is sent to each CPU that has more
    than one RT task scheduled on it, it references the root domain's rto_mask,
    that contains all the CPUs within the root domain that has more than one RT
    task in the runable state. The problem is, after the IPIs are initiated, the
    rq->lock is released. This means that the root domain that is associated to
    the run queue could be freed while the IPIs are going around.

    Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
    the root domain's ref count respectively. This way when initiating the IPIs,
    the scheduler will up the root domain's ref count before releasing the
    rq->lock, ensuring that the root domain does not go away until the IPI round
    is complete.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit ad0f1d9d65938aec72a698116cd73a980916895e upstream.

    When the rto_push_irq_work_func() is called, it looks at the RT overloaded
    bitmask in the root domain via the runqueue (rq->rd). The problem is that
    during CPU up and down, nothing here stops rq->rd from changing between
    taking the rq->rd->rto_lock and releasing it. That means the lock that is
    released is not the same lock that was taken.

    Instead of using this_rq()->rd to get the root domain, as the irq work is
    part of the root domain, we can simply get the root domain from the irq work
    that is passed to the routine:

    container_of(work, struct root_domain, rto_push_work)

    This keeps the root domain consistent.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit c6b9d9a33029014446bd9ed84c1688f6d3d4eab9 upstream.

    The following cleanup commit:

    50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

    ... unintentionally changed the behavior of add_wait_queue() from
    inserting the wait entry at the head of the wait queue to the tail
    of the wait queue.

    Beyond a negative performance impact this change in behavior
    theoretically also breaks wait queues which mix exclusive and
    non-exclusive waiters, as non-exclusive waiters will not be
    woken up if they are queued behind enough exclusive waiters.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
    Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

24 Jan, 2018

1 commit

  • commit c96f5471ce7d2aefd0dda560cc23f08ab00bc65d upstream.

    Before commit:

    e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

    delayacct_blkio_end() was called after context-switching into the task which
    completed I/O.

    This resulted in double counting: the task would account a delay both waiting
    for I/O and for time spent in the runqueue.

    With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
    In ttwu, we have not yet context-switched. This is more correct, in that
    the delay accounting ends when the I/O is complete.

    But delayacct_blkio_end() relies on 'get_current()', and we have not yet
    context-switched into the task whose I/O completed. This results in the
    wrong task having its delay accounting statistics updated.

    Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
    so that it can update the statistics of the correct task.

    Signed-off-by: Josh Snyder
    Acked-by: Tejun Heo
    Acked-by: Balbir Singh
    Cc: Brendan Gregg
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-block@vger.kernel.org
    Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
    Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Josh Snyder
     

17 Jan, 2018

1 commit

  • commit 541676078b52f365f53d46ee5517d305cd1b6350 upstream.

    smp_call_function_many() requires disabling preemption around the call.

    Signed-off-by: Mathieu Desnoyers
    Cc: Andrea Parri
    Cc: Andrew Hunter
    Cc: Avi Kivity
    Cc: Benjamin Herrenschmidt
    Cc: Boqun Feng
    Cc: Dave Watson
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Maged Michael
    Cc: Michael Ellerman
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     

03 Jan, 2018

1 commit

  • commit 466a2b42d67644447a1765276259a3ea5531ddff upstream.

    Since the recent remote cpufreq callback work, its possible that a cpufreq
    update is triggered from a remote CPU. For single policies however, the current
    code uses the local CPU when trying to determine if the remote sg_cpu entered
    idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
    idle_calls counter of the remote CPU.

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Joel Fernandes
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     

20 Dec, 2017

1 commit

  • commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream.

    Daniel Wagner reported a crash on the BeagleBone Black SoC.

    This is a single CPU architecture, and does not have a functional
    arch_send_call_function_single_ipi() implementation which can crash
    the kernel if that is called.

    As it only has one CPU, it shouldn't be called, but if the kernel is
    compiled for SMP, the push/pull RT scheduling logic now calls it for
    irq_work if the one CPU is overloaded, it can use that function to call
    itself and crash the kernel.

    Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
    only has a single CPU. But SCHED_FEAT is a constant if sched debugging
    is turned off. Another fix can also be used, and this should also help
    with normal SMP machines. That is, do not initiate the pull code if
    there's only one RT overloaded CPU, and that CPU happens to be the
    current CPU that is scheduling in a lower priority task.

    Even on a system with many CPUs, if there's many RT tasks waiting to
    run on a single CPU, and that CPU schedules in another RT task of lower
    priority, it will initiate the PULL logic in case there's a higher
    priority RT task on another CPU that is waiting to run. But if there is
    no other CPU with waiting RT tasks, it will initiate the RT pull logic
    on itself (as it still has RT tasks waiting to run). This is a wasted
    effort.

    Not only does this help with SMP code where the current CPU is the only
    one with RT overloaded tasks, it should also solve the issue that
    Daniel encountered, because it will prevent the PULL logic from
    executing, as there's only one CPU on the system, and the check added
    here will cause it to exit the RT pull code.

    Reported-by: Daniel Wagner
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: linux-rt-users
    Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     

30 Nov, 2017

3 commits

  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.

    The current implementation of synchronize_sched_expedited() incorrectly
    assumes that resched_cpu() is unconditional, which it is not. This means
    that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
    fails as follows (analysis by Neeraj Upadhyay):

    o CPU1 is waiting for expedited wait to complete:

    sync_rcu_exp_select_cpus
    rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
    IPI sent to CPU5

    synchronize_sched_expedited_wait
    ret = swait_event_timeout(rsp->expedited_wq,
    sync_rcu_preempt_exp_done(rnp_root),
    jiffies_stall);

    expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

    o CPU5 handles IPI and fails to acquire rq lock.

    Handles IPI
    sync_sched_exp_handler
    resched_cpu
    returns while failing to try lock acquire rq->lock
    need_resched is not set

    o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
    idle (schedule() is not called).

    o CPU 1 reports RCU stall.

    Given that resched_cpu() is now used only by RCU, this commit fixes the
    assumption by making resched_cpu() unconditional.

    Reported-by: Neeraj Upadhyay
    Suggested-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.

    'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

09 Nov, 2017

1 commit


05 Nov, 2017

1 commit

  • After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
    callbacks) we stopped to always read the utilization for the CPU we
    are running the governor on, and instead we read it for the CPU
    which we've been told has updated utilization. This is stored in
    sugov_cpu->cpu.

    The value is set in sugov_register() but we clear it in sugov_start()
    which leads to always looking at the utilization of CPU0 instead of
    the correct one.

    Fix this by consolidating the initialization code into sugov_start().

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Signed-off-by: Chris Redpath
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Brendan Jackman
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Chris Redpath
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

10 Oct, 2017

3 commits

  • While load_balance() masks the source CPUs against active_mask, it had
    a hole against the destination CPU. Ensure the destination CPU is also
    part of the 'domain-mask & active-mask' set.

    Reported-by: Levin, Alexander (Sasha Levin)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The trivial wake_affine_idle() implementation is very good for a
    number of workloads, but it comes apart at the moment there are no
    idle CPUs left, IOW. the overloaded case.

    hackbench:

    NO_WA_WEIGHT WA_WEIGHT

    hackbench-20 : 7.362717561 seconds 6.450509391 seconds

    (win)

    netperf:

    NO_WA_WEIGHT WA_WEIGHT

    TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
    TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
    TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
    TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
    TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4

    TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
    TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
    TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
    TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
    TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41

    TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
    TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
    TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
    TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
    TCP_RR-80 : Avg: 25330.3 Avg: 26867.9

    UDP_RR-1 : Avg: 108266 Avg: 107844
    UDP_RR-10 : Avg: 95480 Avg: 95245.2
    UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
    UDP_RR-40 : Avg: 76231 Avg: 75419.1
    UDP_RR-80 : Avg: 34578.3 Avg: 35639.1

    UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
    UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
    UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
    UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
    UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97

    (wins and losses)

    sysbench:

    NO_WA_WEIGHT WA_WEIGHT

    sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
    sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
    sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
    sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
    sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
    sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.

    sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
    sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
    sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
    sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
    sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
    sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.

    (increase on the top end)

    tbench:

    NO_WA_WEIGHT

    Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
    Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
    Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
    Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
    Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
    Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms

    WA_WEIGHT

    Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
    Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
    Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
    Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
    Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
    Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms

    (increase on the top end)

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Eric reported a sysbench regression against commit:

    3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")

    Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
    against his v3.10 enterprise kernel.

    PRE (current tip/master):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64110 (2136.94 per sec.)
    5: [30 secs] transactions: 143644 (4787.99 per sec.)
    10: [30 secs] transactions: 274298 (9142.93 per sec.)
    20: [30 secs] transactions: 418683 (13955.45 per sec.)
    40: [30 secs] transactions: 320731 (10690.15 per sec.)
    80: [30 secs] transactions: 355096 (11834.28 per sec.)

    hsw-ex NAS:

    OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
    OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
    OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
    lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
    lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
    lu.C.x_threads_144_run_3.log: Time in seconds = 433.83

    POST (+patch):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64494 (2149.75 per sec.)
    5: [30 secs] transactions: 145114 (4836.99 per sec.)
    10: [30 secs] transactions: 278311 (9276.69 per sec.)
    20: [30 secs] transactions: 437169 (14571.60 per sec.)
    40: [30 secs] transactions: 669837 (22326.73 per sec.)
    80: [30 secs] transactions: 631739 (21055.88 per sec.)

    hsw-ex NAS:

    lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
    lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
    lu.C.x_threads_144_run_3.log: Time in seconds = 22.52

    This patch takes out all the shiny wake_affine() stuff and goes back to
    utter basics. Between the two CPUs involved with the wakeup (the CPU
    doing the wakeup and the CPU we ran on previously) pick the CPU we can
    run on _now_.

    This restores much of the regressions against the older kernels,
    but leaves some ground in the overloaded case. The default-enabled
    WA_WEIGHT (which will be introduced in the next patch) is an attempt
    to address the overloaded situation.

    Reported-by: Eric Farman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Linus Torvalds
    Cc: Matthew Rosato
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: jinpuwang@gmail.com
    Cc: vcaputo@pengaru.com
    Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Sep, 2017

2 commits


15 Sep, 2017

2 commits

  • Now that we have added breaks in the wait queue scan and allow bookmark
    on scan position, we put this logic in the wake_up_page_bit function.

    We can have very long page wait list in large system where multiple
    pages share the same wait list. We break the wake up walk here to allow
    other cpus a chance to access the list, and not to disable the interrupts
    when traversing the list for too long. This reduces the interrupt and
    rescheduling latency, and excessive page wait queue lock hold time.

    [ v2: Remove bookmark_wake_function ]

    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • We encountered workloads that have very long wake up list on large
    systems. A waker takes a long time to traverse the entire wake list and
    execute all the wake functions.

    We saw page wait list that are up to 3700+ entries long in tests of
    large 4 and 8 socket systems. It took 0.8 sec to traverse such list
    during wake up. Any other CPU that contends for the list spin lock will
    spin for a long time. It is a result of the numa balancing migration of
    hot pages that are shared by many threads.

    Multiple CPUs waking are queued up behind the lock, and the last one
    queued has to wait until all CPUs did all the wakeups.

    The page wait list is traversed with interrupt disabled, which caused
    various problems. This was the original cause that triggered the NMI
    watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
    extending the NMI watch dog timer there helped.

    This patch bookmarks the waker's scan position in wake list and break
    the wake up walk, to allow access to the list before the waker resume
    its walk down the rest of the wait list. It lowers the interrupt and
    rescheduling latency.

    This patch also provides a performance boost when combined with the next
    patch to break up page wakeup list walk. We saw 22% improvement in the
    will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

    [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
    simply access to flags. ]

    Reported-by: Kan Liang
    Tested-by: Kan Liang
    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     

14 Sep, 2017

1 commit


13 Sep, 2017

1 commit