12 Oct, 2012

2 commits

  • Pull scheduler fixes from Ingo Molnar:
    "A CPU hotplug related crash fix and a nohz accounting fixlet."

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Update sched_domains_numa_masks[][] when new cpus are onlined
    sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
    nohz: Fix one jiffy count too far in idle cputime

    Linus Torvalds
     
  • Pull pile 2 of execve and kernel_thread unification work from Al Viro:
    "Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
    several more architectures plus assorted signal fixes and cleanups.

    There'll be more (in particular, real fixes for the alpha
    do_notify_resume() irq mess)..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
    alpha: don't open-code trace_report_syscall_{enter,exit}
    Uninclude linux/freezer.h
    m32r: trim masks
    avr32: trim masks
    tile: don't bother with SIGTRAP in setup_frame
    microblaze: don't bother with SIGTRAP in setup_rt_frame()
    mn10300: don't bother with SIGTRAP in setup_frame()
    frv: no need to raise SIGTRAP in setup_frame()
    x86: get rid of duplicate code in case of CONFIG_VM86
    unicore32: remove pointless test
    h8300: trim _TIF_WORK_MASK
    parisc: decide whether to go to slow path (tracesys) based on thread flags
    parisc: don't bother looping in do_signal()
    parisc: fix double restarts
    bury the rest of TIF_IRET
    sanitize tsk_is_polling()
    bury _TIF_RESTORE_SIGMASK
    unicore32: unobfuscate _TIF_WORK_MASK
    mips: NOTIFY_RESUME is not needed in TIF masks
    mips: merge the identical "return from syscall" per-ABI code
    ...

    Conflicts:
    arch/arm/include/asm/thread_info.h

    Linus Torvalds
     

05 Oct, 2012

2 commits

  • Once array sched_domains_numa_masks[] []is defined, it is never updated.

    When a new cpu on a new node is onlined, the coincident member in
    sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
    As a result, the build_overlap_sched_groups() will initialize a NULL
    sched_group for the new cpu on the new node, which will lead to kernel panic:

    [ 3189.403280] Call Trace:
    [ 3189.403286] [] warn_slowpath_common+0x7f/0xc0
    [ 3189.403289] [] warn_slowpath_null+0x1a/0x20
    [ 3189.403292] [] build_sched_domains+0x467/0x470
    [ 3189.403296] [] partition_sched_domains+0x307/0x510
    [ 3189.403299] [] ? partition_sched_domains+0x142/0x510
    [ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90
    [ 3189.403308] [] cpuset_cpu_active+0x38/0x70
    [ 3189.403316] [] notifier_call_chain+0x67/0x150
    [ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5
    [ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10
    [ 3189.403333] [] __cpu_notify+0x20/0x40
    [ 3189.403337] [] _cpu_up+0xe9/0x131
    [ 3189.403340] [] cpu_up+0xdb/0xee
    [ 3189.403348] [] store_online+0x9c/0xd0
    [ 3189.403355] [] dev_attr_store+0x20/0x30
    [ 3189.403361] [] sysfs_write_file+0xa3/0x100
    [ 3189.403368] [] vfs_write+0xd0/0x1a0
    [ 3189.403371] [] sys_write+0x54/0xa0
    [ 3189.403375] [] system_call_fastpath+0x16/0x1b
    [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
    [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

    This patch registers a new notifier for cpu hotplug notify chain, and
    updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    [ fixed compile warning ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Tang Chen
     
  • We should temporarily reset 'sched_domains_numa_levels' to 0 after
    it is reset to 'level' in sched_init_numa(). If it fails to allocate
    memory for array sched_domains_numa_masks[][], the array will contain
    less then 'level' members. This could be dangerous when we use it to
    iterate array sched_domains_numa_masks[][] in other functions.

    This patch set sched_domains_numa_levels to 0 before initializing
    array sched_domains_numa_masks[][], and reset it to 'level' when
    sched_domains_numa_masks[][] is fully initialized.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Tang Chen
     

02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Make default just return 0. The current default (checking
    TIF_POLLING_NRFLAG) is taken to architectures that need it;
    ones that don't do polling in their idle threads don't need
    to defined TIF_POLLING_NRFLAG at all.

    ia64 defined both TS_POLLING (used by its tsk_is_polling())
    and TIF_POLLING_NRFLAG (not used at all). Killed the latter...

    Signed-off-by: Al Viro

    Al Viro
     

26 Sep, 2012

4 commits

  • When exceptions or irq are about to resume userspace, if
    the task needs to be rescheduled, the arch low level code
    calls schedule() directly.

    If we call it, it is because we have the TIF_RESCHED flag:

    - It can be set after random local calls to set_need_resched()
    (RCU, drm, ...)

    - A wake up happened and the CPU needs preemption. This can
    happen in several ways:

    * Remotely: the remote waking CPU has set TIF_RESCHED and send the
    wakee an IPI to schedule the new task.
    * Remotely enqueued: the remote waking CPU sends an IPI to the target
    and the wake up is made by the target.
    * Locally: waking CPU == wakee CPU and the wakeup is done locally.
    set_need_resched() is called without IPI.

    In the case of local and remotely enqueued wake ups, the tick can
    be restarted when we enqueue the new task and RCU can exit the
    extended quiescent state at the same time. Then by the time we reach
    irq exit path and we call schedule, we are not in RCU user mode.

    But if we call schedule() only because something called set_need_resched(),
    RCU may still be in user mode when we reach schedule.

    Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
    flag and call schedule while the IPI has not yet happen to restart the
    tick and exit RCU user mode.

    We need to manually protect against these corner cases.

    Create a new API schedule_user() that calls schedule() inside
    rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
    will need to rely on it now to implement user preemption safely.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • When an exception or an irq exits, and we are going to resume into
    interrupted kernel code, the low level architecture code calls
    preempt_schedule_irq() if there is a need to reschedule.

    If the interrupt/exception occured between a call to rcu_user_enter()
    (from syscall exit, exception exit, do_notify_resume exit, ...) and
    a real resume to userspace (iret,...), preempt_schedule_irq() can be
    called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
    is going to run kernel code and may be some RCU read side critical
    section. We must exit the userspace extended quiescent state before
    we call it.

    To solve this, just call rcu_user_exit() in the beginning of
    preempt_schedule_irq().

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Clear the syscalls hook of a task when it's scheduled out so that if
    the task migrates, it doesn't run the syscall slow path on a CPU
    that might not need it.

    Also set the syscalls hook on the next task if needed.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
    approach from https://lkml.org/lkml/2012/9/5/585.

    Paul E. McKenney
     

25 Sep, 2012

1 commit

  • Use a naming based on vtime as a prefix for virtual based
    cputime accounting APIs:

    - account_system_vtime() -> vtime_account()
    - account_switch_vtime() -> vtime_task_switch()

    It makes it easier to allow for further declension such
    as vtime_account_system(), vtime_account_idle(), ... if we
    want to find out the context we account to from generic code.

    This also make it better to know on which subsystem these APIs
    refer to.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

23 Sep, 2012

1 commit

  • Rabik and Paul reported two different issues related to the same few
    lines of code.

    Rabik's issue is that the nr_uninterruptible migration code is wrong in
    that he sees artifacts due to this (Rabik please do expand in more
    detail).

    Paul's issue is that this code as it stands relies on us using
    stop_machine() for unplug, we all would like to remove this assumption
    so that eventually we can remove this stop_machine() usage altogether.

    The only reason we'd have to migrate nr_uninterruptible is so that we
    could use for_each_online_cpu() loops in favour of
    for_each_possible_cpu() loops, however since nr_uninterruptible() is the
    only such loop and its using possible lets not bother at all.

    The problem Rabik sees is (probably) caused by the fact that by
    migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
    involved.

    So don't bother with fancy migration schemes (meaning we now have to
    keep using for_each_possible_cpu()) and instead fold any nr_active delta
    after we migrate all tasks away to make sure we don't have any skewed
    nr_active accounting.

    [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
    miscounting noted by Rakib. ]

    Reported-by: Rakib Mullick
    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     

17 Sep, 2012

1 commit

  • This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda.

    Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
    performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

    Borislav Petkov was able to reproduce this, and bisected it to this
    commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...")
    apparently because the new single-idle-buddy model simply doesn't find
    idle CPU's to reschedule on aggressively enough.

    Mike Galbraith suspects that it is likely due to the user-mode spinlocks
    in PostgreSQL not reacting well to preemption, but we don't really know
    the details - I'll just revert the commit for now.

    There are hopefully other approaches to improve scheduler scalability
    without it causing these kinds of downsides.

    Reported-by: Nikolay Ulyanitsky
    Bisected-by: Borislav Petkov
    Acked-by: Mike Galbraith
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Sep, 2012

2 commits

  • Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
    incomplete fix:

    In particular, the problem is that at the point it calls
    calc_load_migrate() nr_running := 1 (the stopper thread), so move the
    call to CPU_DEAD where we're sure that nr_running := 0.

    Also note that we can call calc_load_migrate() without serialization, we
    know the state of rq is stable since its cpu is dead, and we modify the
    global state using appropriate atomic ops.

    Suggested-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that the last architecture to use this has stopped doing so (ARM,
    thanks Catalin!) we can remove this complexity from the scheduler
    core.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Catalin Marinas
    Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Sep, 2012

5 commits

  • It's impossible to enter the else branch if we have set
    skip_clock_update in task_yield_fair(), as yield_to_task_fair()
    will directly return true after invoke task_yield_fair().

    Signed-off-by: Michael Wang
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Michael Wang
     
  • Various sd->*_idx's are used for refering the rq's load average table
    when selecting a cpu to run. However they can be set to any number
    with sysctl knobs so that it can crash the kernel if something bad is
    given. Fix it by limiting them into the actual range.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     
  • Merge in the current fixes branch, we are going to apply dependent patches.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • migrate_tasks() uses _pick_next_task_rt() to get tasks from the
    real-time runqueues to be migrated. When rt_rq is throttled
    _pick_next_task_rt() won't return anything, in which case
    migrate_tasks() can't move all threads over and gets stuck in an
    infinite loop.

    Instead unthrottle rt runqueues before migrating tasks.

    Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

    Signed-off-by: Peter Boonstoppel
    Signed-off-by: Peter Zijlstra
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
    Signed-off-by: Ingo Molnar

    Peter Boonstoppel
     
  • Rabik and Paul reported two different issues related to the same few
    lines of code.

    Rabik's issue is that the nr_uninterruptible migration code is wrong in
    that he sees artifacts due to this (Rabik please do expand in more
    detail).

    Paul's issue is that this code as it stands relies on us using
    stop_machine() for unplug, we all would like to remove this assumption
    so that eventually we can remove this stop_machine() usage altogether.

    The only reason we'd have to migrate nr_uninterruptible is so that we
    could use for_each_online_cpu() loops in favour of
    for_each_possible_cpu() loops, however since nr_uninterruptible() is the
    only such loop and its using possible lets not bother at all.

    The problem Rabik sees is (probably) caused by the fact that by
    migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
    involved.

    So don't bother with fancy migration schemes (meaning we now have to
    keep using for_each_possible_cpu()) and instead fold any nr_active delta
    after we migrate all tasks away to make sure we don't have any skewed
    nr_active accounting.

    Reported-by: Rakib Mullick
    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Aug, 2012

1 commit


20 Aug, 2012

2 commits

  • The archs that implement virtual cputime accounting all
    flush the cputime of a task when it gets descheduled
    and sometimes set up some ground initialization for the
    next task to account its cputime.

    These archs all put their own hooks in their context
    switch callbacks and handle the off-case themselves.

    Consolidate this by creating a new account_switch_vtime()
    callback called in generic code right after a context switch
    and that these archs must implement to flush the prev task
    cputime and initialize the next task cputime related state.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Martin Schwidefsky
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     
  • Extract cputime code from the giant sched/core.c and
    put it in its own file. This make it easier to deal with
    this particular area and de-bloat a bit more core.c

    Signed-off-by: Frederic Weisbecker
    Acked-by: Martin Schwidefsky
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

14 Aug, 2012

5 commits

  • Since power saving code was removed from sched now, the implement
    code is out of service in this function, and even pollute other logical.
    like, 'want_sd' never has chance to be set '0', that remove the effect
    of SD_WAKE_AFFINE here.

    So, clean up the obsolete code, includes SD_PREFER_LOCAL.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
    Signed-off-by: Thomas Gleixner

    Alex Shi
     
  • This patch adds a comment on top of the schedule() function to explain
    to scheduler newbies how the main scheduler function is entered.

    Acked-by: Randy Dunlap
    Explained-by: Ingo Molnar
    Explained-by: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
    Signed-off-by: Thomas Gleixner

    Pekka Enberg
     
  • Thomas Gleixner
     
  • With multiple instances of task_groups, for_each_rt_rq() is a noop,
    no task groups having been added to the rt.c list instance. This
    renders __enable/disable_runtime() and print_rt_stats() noop, the
    user (non) visible effect being that rt task groups are missing in
    /proc/sched_debug.

    Signed-off-by: Mike Galbraith
    Cc: stable@kernel.org # v3.3+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     
  • On architectures where cputime_t is 64 bit type, is possible to trigger
    divide by zero on do_div(temp, (__force u32) total) line, if total is a
    non zero number but has lower 32 bit's zeroed. Removing casting is not
    a good solution since some do_div() implementations do cast to u32
    internally.

    This problem can be triggered in practice on very long lived processes:

    PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
    #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
    #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
    #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
    #3 [ffff880472a51cd0] die at ffffffff8100f26b
    #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
    #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
    #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
    [exception RIP: thread_group_times+0x56]
    RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
    RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
    RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
    RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
    R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
    R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
    #8 [ffff880472a51f40] sys_times at ffffffff81088524
    #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
    RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
    RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
    RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
    R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
    R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
    ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

    Cc: stable@vger.kernel.org
    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
    Signed-off-by: Thomas Gleixner

    Stanislaw Gruszka
     

04 Aug, 2012

1 commit


01 Aug, 2012

1 commit

  • Pull perf updates from Ingo Molnar:
    "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
    updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
    fixes) now that Arnaldo is back from vacation."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    uprobes: __replace_page() needs munlock_vma_page()
    uprobes: Rename vma_address() and make it return "unsigned long"
    uprobes: Fix register_for_each_vma()->vma_address() check
    uprobes: Introduce vaddr_to_offset(vma, vaddr)
    uprobes: Teach build_probe_list() to consider the range
    uprobes: Remove insert_vm_struct()->uprobe_mmap()
    uprobes: Remove copy_vma()->uprobe_mmap()
    uprobes: Fix overflow in vma_address()/find_active_uprobe()
    uprobes: Suppress uprobe_munmap() from mmput()
    uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
    uprobes: Clean up and document write_opcode()->lock_page(old_page)
    uprobes: Kill write_opcode()->lock_page(new_page)
    uprobes: __replace_page() should not use page_address_in_vma()
    uprobes: Don't recheck vma/f_mapping in write_opcode()
    perf/x86: Fix missing struct before structure name
    perf/x86: Fix format definition of SNB-EP uncore QPI box
    perf/x86: Make bitfield unsigned
    perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
    perf/x86: Add Intel Nehalem-EX uncore support
    perf/x86: Fix typo in format definition of uncore PCU filter
    ...

    Linus Torvalds
     

26 Jul, 2012

2 commits

  • Otherwise they can't be filtered for a defined task:

    perf record -e sched:sched_switch ./foo

    This command doesn't report any events without this patch.

    I think it isn't a security concern if someone knows who will
    be executed next - this can already be observed by polling /proc
    state. By default perf is disabled for non-root users in any case.

    I need these events for profiling sleep times. sched_switch is used for
    getting callchains and sched_stat_* is used for getting time periods.
    These events are combined in user space, then it can be analyzed by
    perf tools.

    Signed-off-by: Andrew Vagin
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Arun Sharma
    Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
    Signed-off-by: Ingo Molnar

    Andrew Vagin
     
  • It seems there's no specific reason to open-code it. I guess
    commit 0122ec5b02f76 ("sched: Add p->pi_lock to task_rq_lock()")
    simply missed it. Let's be consistent with others.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1341647342-6742-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

24 Jul, 2012

4 commits

  • Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
    Don't call task_group() too many times in set_task_rq()"), he
    found the reason to be that the multiple task_group()
    invocations in set_task_rq() returned different values.

    Looking at all that I found a lack of serialization and plain
    wrong comments.

    The below tries to fix it using an extra pointer which is
    updated under the appropriate scheduler locks. Its not pretty,
    but I can't really see another way given how all the cgroup
    stuff works.

    Reported-and-tested-by: Stefan Bader
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Traversing an entire package is not only expensive, it also leads to tasks
    bouncing all over a partially idle and possible quite large package. Fix
    that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
    to motivate that one other CPU, if it's busy, tough, it may then try its
    SMT sibling, but that's all this optimization is allowed to cost.

    Sibling cache buddies are cross-wired to prevent bouncing.

    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

    clients 1 2 4 8 16 32 64 128
    ..........................................................................
    pre 30 41 118 645 3769 6214 12233 14312
    post 299 603 1211 2418 4697 6847 11606 14557

    A nice increase in performance.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Separate out the cpuset related handling for CPU/Memory online/offline.
    This also helps us exploit the most obvious and basic level of optimization
    that any notification mechanism (CPU/Mem online/offline) has to offer us:
    "We *know* why we have been invoked. So stop pretending that we are lost,
    and do only the necessary amount of processing!".

    And while at it, rename scan_for_empty_cpusets() to
    scan_cpusets_upon_hotplug(), which is more appropriate considering how
    it is restructured.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
    masks as and when necessary to ensure that the tasks belonging to the cpusets
    have some place (online CPUs) to run on. And regular CPU hotplug is
    destructive in the sense that the kernel doesn't remember the original cpuset
    configurations set by the user, across hotplug operations.

    However, suspend/resume (which uses CPU hotplug) is a special case in which
    the kernel has the responsibility to restore the system (during resume), to
    exactly the same state it was in before suspend.

    In order to achieve that, do the following:

    1. Don't modify cpusets during suspend/resume. At all.
    In particular, don't move the tasks from one cpuset to another, and
    don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
    during the CPU hotplug operations that are carried out in the
    suspend/resume path.

    2. However, cpusets and sched domains are related. We just want to avoid
    altering cpusets alone. So, to keep the sched domains updated, build
    a single sched domain (containing all active cpus) during each of the
    CPU hotplug operations carried out in s/r path, effectively ignoring
    the cpusets' cpus_allowed masks.

    (Since userspace is frozen while doing all this, it will go unnoticed.)

    3. During the last CPU online operation during resume, build the sched
    domains by looking up the (unaltered) cpusets' cpus_allowed masks.
    That will bring back the system to the same original state as it was in
    before suspend.

    Ultimately, this will not only solve the cpuset problem related to suspend
    resume (ie., restores the cpusets to exactly what it was before suspend, by
    not touching it at all) but also speeds up suspend/resume because we avoid
    running cpuset update code for every CPU being offlined/onlined.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     

15 Jul, 2012

1 commit

  • …t-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull RCU, perf, and scheduler fixes from Ingo Molnar.

    The RCU fix is a revert for an optimization that could cause deadlocks.

    One of the scheduler commits (164c33c6adee "sched: Fix fork() error path
    to not crash") is correct but not complete (some architectures like Tile
    are not covered yet) - the resulting additional fixes are still WIP and
    Ingo did not want to delay these pending fixes. See this thread on
    lkml:

    [PATCH] fork: fix error handling in dup_task()

    The perf fixes are just trivial oneliners.

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf kvm: Fix segfault with report and mixed guestmount use
    perf kvm: Fix regression with guest machine creation
    perf script: Fix format regression due to libtraceevent merge
    ring-buffer: Fix accounting of entries when removing pages
    ring-buffer: Fix crash due to uninitialized new_pages list head

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    MAINTAINERS/sched: Update scheduler file pattern
    sched/nohz: Rewrite and fix load-avg computation -- again
    sched: Fix fork() error path to not crash

    Linus Torvalds
     

06 Jul, 2012

1 commit

  • Thanks to Charles Wang for spotting the defects in the current code:

    - If we go idle during the sample window -- after sampling, we get a
    negative bias because we can negate our own sample.

    - If we wake up during the sample window we get a positive bias
    because we push the sample to a known active period.

    So rewrite the entire nohz load-avg muck once again, now adding
    copious documentation to the code.

    Reported-and-tested-by: Doug Smythies
    Reported-and-tested-by: Charles Wang
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jul, 2012

1 commit

  • This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9.
    (Move PREEMPT_RCU preemption to switch_to() invocation).
    Testing by Sasha Levin showed that this
    can result in deadlock due to invoking the scheduler when one of
    the runqueue locks is held. Because this commit was simply a
    performance optimization, revert it.

    Reported-by: Sasha Levin
    Signed-off-by: Paul E. McKenney
    Tested-by: Sasha Levin

    Paul E. McKenney
     

06 Jun, 2012

1 commit

  • It does not get processed because sched_domain_level_max is 0 at the
    time that setup_relax_domain_level() is run.

    Simply accept the value as it is, as we don't know the value of
    sched_domain_level_max until sched domain construction is completed.

    Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
    the set_domain_attribute() routine prior to setting the sd->level, however,
    the set_domain_attribute() routine relies on the sd->level to decide whether
    idle load balancing will be off/on.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
    Signed-off-by: Ingo Molnar

    Dimitri Sivanich