12 Oct, 2012

2 commits

  • Pull scheduler fixes from Ingo Molnar:
    "A CPU hotplug related crash fix and a nohz accounting fixlet."

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Update sched_domains_numa_masks[][] when new cpus are onlined
    sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
    nohz: Fix one jiffy count too far in idle cputime

    Linus Torvalds
     
  • Pull pile 2 of execve and kernel_thread unification work from Al Viro:
    "Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
    several more architectures plus assorted signal fixes and cleanups.

    There'll be more (in particular, real fixes for the alpha
    do_notify_resume() irq mess)..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
    alpha: don't open-code trace_report_syscall_{enter,exit}
    Uninclude linux/freezer.h
    m32r: trim masks
    avr32: trim masks
    tile: don't bother with SIGTRAP in setup_frame
    microblaze: don't bother with SIGTRAP in setup_rt_frame()
    mn10300: don't bother with SIGTRAP in setup_frame()
    frv: no need to raise SIGTRAP in setup_frame()
    x86: get rid of duplicate code in case of CONFIG_VM86
    unicore32: remove pointless test
    h8300: trim _TIF_WORK_MASK
    parisc: decide whether to go to slow path (tracesys) based on thread flags
    parisc: don't bother looping in do_signal()
    parisc: fix double restarts
    bury the rest of TIF_IRET
    sanitize tsk_is_polling()
    bury _TIF_RESTORE_SIGMASK
    unicore32: unobfuscate _TIF_WORK_MASK
    mips: NOTIFY_RESUME is not needed in TIF masks
    mips: merge the identical "return from syscall" per-ABI code
    ...

    Conflicts:
    arch/arm/include/asm/thread_info.h

    Linus Torvalds
     

05 Oct, 2012

2 commits

  • Once array sched_domains_numa_masks[] []is defined, it is never updated.

    When a new cpu on a new node is onlined, the coincident member in
    sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
    As a result, the build_overlap_sched_groups() will initialize a NULL
    sched_group for the new cpu on the new node, which will lead to kernel panic:

    [ 3189.403280] Call Trace:
    [ 3189.403286] [] warn_slowpath_common+0x7f/0xc0
    [ 3189.403289] [] warn_slowpath_null+0x1a/0x20
    [ 3189.403292] [] build_sched_domains+0x467/0x470
    [ 3189.403296] [] partition_sched_domains+0x307/0x510
    [ 3189.403299] [] ? partition_sched_domains+0x142/0x510
    [ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90
    [ 3189.403308] [] cpuset_cpu_active+0x38/0x70
    [ 3189.403316] [] notifier_call_chain+0x67/0x150
    [ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5
    [ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10
    [ 3189.403333] [] __cpu_notify+0x20/0x40
    [ 3189.403337] [] _cpu_up+0xe9/0x131
    [ 3189.403340] [] cpu_up+0xdb/0xee
    [ 3189.403348] [] store_online+0x9c/0xd0
    [ 3189.403355] [] dev_attr_store+0x20/0x30
    [ 3189.403361] [] sysfs_write_file+0xa3/0x100
    [ 3189.403368] [] vfs_write+0xd0/0x1a0
    [ 3189.403371] [] sys_write+0x54/0xa0
    [ 3189.403375] [] system_call_fastpath+0x16/0x1b
    [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
    [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

    This patch registers a new notifier for cpu hotplug notify chain, and
    updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    [ fixed compile warning ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Tang Chen
     
  • We should temporarily reset 'sched_domains_numa_levels' to 0 after
    it is reset to 'level' in sched_init_numa(). If it fails to allocate
    memory for array sched_domains_numa_masks[][], the array will contain
    less then 'level' members. This could be dangerous when we use it to
    iterate array sched_domains_numa_masks[][] in other functions.

    This patch set sched_domains_numa_levels to 0 before initializing
    array sched_domains_numa_masks[][], and reset it to 'level' when
    sched_domains_numa_masks[][] is fully initialized.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Tang Chen
     

02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Make default just return 0. The current default (checking
    TIF_POLLING_NRFLAG) is taken to architectures that need it;
    ones that don't do polling in their idle threads don't need
    to defined TIF_POLLING_NRFLAG at all.

    ia64 defined both TS_POLLING (used by its tsk_is_polling())
    and TIF_POLLING_NRFLAG (not used at all). Killed the latter...

    Signed-off-by: Al Viro

    Al Viro
     

26 Sep, 2012

4 commits

  • When exceptions or irq are about to resume userspace, if
    the task needs to be rescheduled, the arch low level code
    calls schedule() directly.

    If we call it, it is because we have the TIF_RESCHED flag:

    - It can be set after random local calls to set_need_resched()
    (RCU, drm, ...)

    - A wake up happened and the CPU needs preemption. This can
    happen in several ways:

    * Remotely: the remote waking CPU has set TIF_RESCHED and send the
    wakee an IPI to schedule the new task.
    * Remotely enqueued: the remote waking CPU sends an IPI to the target
    and the wake up is made by the target.
    * Locally: waking CPU == wakee CPU and the wakeup is done locally.
    set_need_resched() is called without IPI.

    In the case of local and remotely enqueued wake ups, the tick can
    be restarted when we enqueue the new task and RCU can exit the
    extended quiescent state at the same time. Then by the time we reach
    irq exit path and we call schedule, we are not in RCU user mode.

    But if we call schedule() only because something called set_need_resched(),
    RCU may still be in user mode when we reach schedule.

    Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
    flag and call schedule while the IPI has not yet happen to restart the
    tick and exit RCU user mode.

    We need to manually protect against these corner cases.

    Create a new API schedule_user() that calls schedule() inside
    rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
    will need to rely on it now to implement user preemption safely.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • When an exception or an irq exits, and we are going to resume into
    interrupted kernel code, the low level architecture code calls
    preempt_schedule_irq() if there is a need to reschedule.

    If the interrupt/exception occured between a call to rcu_user_enter()
    (from syscall exit, exception exit, do_notify_resume exit, ...) and
    a real resume to userspace (iret,...), preempt_schedule_irq() can be
    called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
    is going to run kernel code and may be some RCU read side critical
    section. We must exit the userspace extended quiescent state before
    we call it.

    To solve this, just call rcu_user_exit() in the beginning of
    preempt_schedule_irq().

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Clear the syscalls hook of a task when it's scheduled out so that if
    the task migrates, it doesn't run the syscall slow path on a CPU
    that might not need it.

    Also set the syscalls hook on the next task if needed.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
    approach from https://lkml.org/lkml/2012/9/5/585.

    Paul E. McKenney
     

25 Sep, 2012

2 commits

  • Move the code that finds out to which context we account the
    cputime into generic layer.

    Archs that consider the whole time spent in the idle task as idle
    time (ia64, powerpc) can rely on the generic vtime_account()
    and implement vtime_account_system() and vtime_account_idle(),
    letting the generic code to decide when to call which API.

    Archs that have their own meaning of idle time, such as s390
    that only considers the time spent in CPU low power mode as idle
    time, can just override vtime_account().

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     
  • Use a naming based on vtime as a prefix for virtual based
    cputime accounting APIs:

    - account_system_vtime() -> vtime_account()
    - account_switch_vtime() -> vtime_task_switch()

    It makes it easier to allow for further declension such
    as vtime_account_system(), vtime_account_idle(), ... if we
    want to find out the context we account to from generic code.

    This also make it better to know on which subsystem these APIs
    refer to.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

23 Sep, 2012

1 commit

  • Rabik and Paul reported two different issues related to the same few
    lines of code.

    Rabik's issue is that the nr_uninterruptible migration code is wrong in
    that he sees artifacts due to this (Rabik please do expand in more
    detail).

    Paul's issue is that this code as it stands relies on us using
    stop_machine() for unplug, we all would like to remove this assumption
    so that eventually we can remove this stop_machine() usage altogether.

    The only reason we'd have to migrate nr_uninterruptible is so that we
    could use for_each_online_cpu() loops in favour of
    for_each_possible_cpu() loops, however since nr_uninterruptible() is the
    only such loop and its using possible lets not bother at all.

    The problem Rabik sees is (probably) caused by the fact that by
    migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
    involved.

    So don't bother with fancy migration schemes (meaning we now have to
    keep using for_each_possible_cpu()) and instead fold any nr_active delta
    after we migrate all tasks away to make sure we don't have any skewed
    nr_active accounting.

    [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
    miscounting noted by Rakib. ]

    Reported-by: Rakib Mullick
    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     

17 Sep, 2012

1 commit

  • This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda.

    Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
    performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

    Borislav Petkov was able to reproduce this, and bisected it to this
    commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...")
    apparently because the new single-idle-buddy model simply doesn't find
    idle CPU's to reschedule on aggressively enough.

    Mike Galbraith suspects that it is likely due to the user-mode spinlocks
    in PostgreSQL not reacting well to preemption, but we don't really know
    the details - I'll just revert the commit for now.

    There are hopefully other approaches to improve scheduler scalability
    without it causing these kinds of downsides.

    Reported-by: Nikolay Ulyanitsky
    Bisected-by: Borislav Petkov
    Acked-by: Mike Galbraith
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Sep, 2012

5 commits

  • Heteregeneous ARM platform uses arch_scale_freq_power function
    to reflect the relative capacity of each core

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • There is no load_balancer to be selected now. It just sets the
    state of the nohz tick to stop.

    So rename the function, pass the 'cpu' as a parameter and then
    remove the useless call from tick_nohz_restart_sched_tick().

    [ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
    s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
    Signed-off-by: Alex Shi
    Acked-by: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
    incomplete fix:

    In particular, the problem is that at the point it calls
    calc_load_migrate() nr_running := 1 (the stopper thread), so move the
    call to CPU_DEAD where we're sure that nr_running := 0.

    Also note that we can call calc_load_migrate() without serialization, we
    know the state of rq is stable since its cpu is dead, and we modify the
    global state using appropriate atomic ops.

    Suggested-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that the last architecture to use this has stopped doing so (ARM,
    thanks Catalin!) we can remove this complexity from the scheduler
    core.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Catalin Marinas
    Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • On tickless systems, one CPU runs load balance for all idle CPUs.

    The cpu_load of this CPU is updated before starting the load balance
    of each other idle CPUs. We should instead update the cpu_load of
    the balance_cpu.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Cc: Venkatesh Pallipadi
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

04 Sep, 2012

7 commits

  • It's impossible to enter the else branch if we have set
    skip_clock_update in task_yield_fair(), as yield_to_task_fair()
    will directly return true after invoke task_yield_fair().

    Signed-off-by: Michael Wang
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Michael Wang
     
  • Various sd->*_idx's are used for refering the rq's load average table
    when selecting a cpu to run. However they can be set to any number
    with sysctl knobs so that it can crash the kernel if something bad is
    given. Fix it by limiting them into the actual range.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     
  • Commit beac4c7e4a1c ("sched: Remove AFFINE_WAKEUPS feature") removed
    use of the flag but left the definition. Get rid of it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     
  • Merge in the current fixes branch, we are going to apply dependent patches.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Fix two kernel-doc warnings in kernel/sched/fair.c:

    Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
    Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     
  • migrate_tasks() uses _pick_next_task_rt() to get tasks from the
    real-time runqueues to be migrated. When rt_rq is throttled
    _pick_next_task_rt() won't return anything, in which case
    migrate_tasks() can't move all threads over and gets stuck in an
    infinite loop.

    Instead unthrottle rt runqueues before migrating tasks.

    Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

    Signed-off-by: Peter Boonstoppel
    Signed-off-by: Peter Zijlstra
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
    Signed-off-by: Ingo Molnar

    Peter Boonstoppel
     
  • Rabik and Paul reported two different issues related to the same few
    lines of code.

    Rabik's issue is that the nr_uninterruptible migration code is wrong in
    that he sees artifacts due to this (Rabik please do expand in more
    detail).

    Paul's issue is that this code as it stands relies on us using
    stop_machine() for unplug, we all would like to remove this assumption
    so that eventually we can remove this stop_machine() usage altogether.

    The only reason we'd have to migrate nr_uninterruptible is so that we
    could use for_each_online_cpu() loops in favour of
    for_each_possible_cpu() loops, however since nr_uninterruptible() is the
    only such loop and its using possible lets not bother at all.

    The problem Rabik sees is (probably) caused by the fact that by
    migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
    involved.

    So don't bother with fancy migration schemes (meaning we now have to
    keep using for_each_possible_cpu()) and instead fold any nr_active delta
    after we migrate all tasks away to make sure we don't have any skewed
    nr_active accounting.

    Reported-by: Rakib Mullick
    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Aug, 2012

1 commit


20 Aug, 2012

2 commits

  • The archs that implement virtual cputime accounting all
    flush the cputime of a task when it gets descheduled
    and sometimes set up some ground initialization for the
    next task to account its cputime.

    These archs all put their own hooks in their context
    switch callbacks and handle the off-case themselves.

    Consolidate this by creating a new account_switch_vtime()
    callback called in generic code right after a context switch
    and that these archs must implement to flush the prev task
    cputime and initialize the next task cputime related state.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Martin Schwidefsky
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     
  • Extract cputime code from the giant sched/core.c and
    put it in its own file. This make it easier to deal with
    this particular area and de-bloat a bit more core.c

    Signed-off-by: Frederic Weisbecker
    Acked-by: Martin Schwidefsky
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

14 Aug, 2012

10 commits

  • Since power saving code was removed from sched now, the implement
    code is out of service in this function, and even pollute other logical.
    like, 'want_sd' never has chance to be set '0', that remove the effect
    of SD_WAKE_AFFINE here.

    So, clean up the obsolete code, includes SD_PREFER_LOCAL.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
    Signed-off-by: Thomas Gleixner

    Alex Shi
     
  • As we already have dst_rq in lb_env, using or changing "this_rq" do not
    make sense.

    This patch will replace "this_rq" with dst_rq in load_balance, and we
    don't need to change "this_rq" while process LBF_SOME_PINNED any more.

    Signed-off-by: Michael Wang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
    Signed-off-by: Thomas Gleixner

    Michael Wang
     
  • This patch adds a comment on top of the schedule() function to explain
    to scheduler newbies how the main scheduler function is entered.

    Acked-by: Randy Dunlap
    Explained-by: Ingo Molnar
    Explained-by: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
    Signed-off-by: Thomas Gleixner

    Pekka Enberg
     
  • It should be sched_nr_latency so fix it before it annoys me more.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • Thomas Gleixner
     
  • Make stop scheduler class do the same accounting as other classes,

    Migration threads can be caught in the act while doing exec balancing,
    leading to the below due to use of unmaintained ->se.exec_start. The
    load that triggered this particular instance was an apparently out of
    control heavily threaded application that does system monitoring in
    what equated to an exec bomb, with one of the VERY frequently migrated
    tasks being ps.

    %CPU PID USER CMD
    99.3 45 root [migration/10]
    97.7 53 root [migration/12]
    97.0 57 root [migration/13]
    90.1 49 root [migration/11]
    89.6 65 root [migration/15]
    88.7 17 root [migration/3]
    80.4 37 root [migration/8]
    78.1 41 root [migration/9]
    44.2 13 root [migration/2]

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     
  • Root task group bandwidth replenishment must service all CPUs, regardless of
    where the timer was last started, and regardless of the isolation mechanism,
    lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     
  • With multiple instances of task_groups, for_each_rt_rq() is a noop,
    no task groups having been added to the rt.c list instance. This
    renders __enable/disable_runtime() and print_rt_stats() noop, the
    user (non) visible effect being that rt task groups are missing in
    /proc/sched_debug.

    Signed-off-by: Mike Galbraith
    Cc: stable@kernel.org # v3.3+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     
  • On architectures where cputime_t is 64 bit type, is possible to trigger
    divide by zero on do_div(temp, (__force u32) total) line, if total is a
    non zero number but has lower 32 bit's zeroed. Removing casting is not
    a good solution since some do_div() implementations do cast to u32
    internally.

    This problem can be triggered in practice on very long lived processes:

    PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
    #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
    #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
    #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
    #3 [ffff880472a51cd0] die at ffffffff8100f26b
    #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
    #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
    #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
    [exception RIP: thread_group_times+0x56]
    RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
    RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
    RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
    RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
    R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
    R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
    #8 [ffff880472a51f40] sys_times at ffffffff81088524
    #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
    RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
    RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
    RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
    R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
    R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
    ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

    Cc: stable@vger.kernel.org
    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
    Signed-off-by: Thomas Gleixner

    Stanislaw Gruszka
     
  • Peter Portante reported that for large cgroup hierarchies (and or on
    large CPU counts) we get immense lock contention on rq->lock and stuff
    stops working properly.

    His workload was a ton of processes, each in their own cgroup,
    everybody idling except for a sporadic wakeup once every so often.

    It was found that:

    schedule()
    idle_balance()
    load_balance()
    local_irq_save()
    double_rq_lock()
    update_h_load()
    walk_tg_tree(tg_load_down)
    tg_load_down()

    Results in an entire cgroup hierarchy walk under rq->lock for every
    new-idle balance and since new-idle balance isn't throttled this
    results in a lot of work while holding the rq->lock.

    This patch does two things, it removes the work from under rq->lock
    based on the good principle of race and pray which is widely employed
    in the load-balancer as a whole. And secondly it throttles the
    update_h_load() calculation to max once per jiffy.

    I considered excluding update_h_load() for new-idle balance
    all-together, but purely relying on regular balance passes to update
    this data might not work out under some rare circumstances where the
    new-idle busiest isn't the regular busiest for a while (unlikely, but
    a nightmare to debug if someone hits it and suffers).

    Cc: pjt@google.com
    Cc: Larry Woodman
    Cc: Mike Galbraith
    Reported-by: Peter Portante
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

04 Aug, 2012

1 commit