30 Sep, 2011

1 commit

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

26 Sep, 2011

1 commit

  • Commit c259e01a1ec ("sched: Separate the scheduler entry for
    preemption") contained a boo-boo wrecking wchan output. It forgot to
    put the new schedule() function in the __sched section and thereby
    doesn't get properly ignored for things like wchan.

    Tested-by: Simon Kirby
    Cc: stable@kernel.org # 2.6.39+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
    Signed-off-by: Ingo Molnar

    Simon Kirby
     

08 Sep, 2011

1 commit


29 Aug, 2011

4 commits

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch fixes the following memory leak:

    unreferenced object 0xffff880107266800 (size 512):
    comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] create_object+0x187/0x28b
    [] kmemleak_alloc+0x73/0x98
    [] __kmalloc_node+0x104/0x159
    [] kzalloc_node.clone.97+0x15/0x17
    [] build_sched_domains+0xb7/0x7f3
    [] partition_sched_domains+0x1db/0x24a
    [] do_rebuild_sched_domains+0x3b/0x47
    [] rebuild_sched_domains+0x10/0x12
    [] sched_power_savings_store+0x6c/0x7b
    [] sched_mc_power_savings_store+0x16/0x18
    [] sysdev_class_store+0x20/0x22
    [] sysfs_write_file+0x108/0x144
    [] vfs_write+0xaf/0x102
    [] sys_write+0x4d/0x74
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    Signed-off-by: WANG Cong
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org # 3.0
    Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
    Signed-off-by: Ingo Molnar

    WANG Cong
     
  • There is no real reason to run blk_schedule_flush_plug() with
    interrupts and preemption disabled.

    Move it into schedule() and call it when the task is going voluntarily
    to sleep. There might be false positives when the task is woken
    between that call and actually scheduling, but that's not really
    different from being woken immediately after switching away.

    This fixes a deadlock in the scheduler where the
    blk_schedule_flush_plug() callchain enables interrupts and thereby
    allows a wakeup to happen of the task that's going to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Block-IO and workqueues call into notifier functions from the
    scheduler core code with interrupts and preemption disabled. These
    calls should be made before entering the scheduler core.

    To simplify this, separate the scheduler core code into
    __schedule(). __schedule() is directly called from the places which
    set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
    checks into schedule(), so they are only called when a task voluntary
    goes to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

26 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds
     

25 Jul, 2011

1 commit

  • * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
    KVM: IOMMU: Disable device assignment without interrupt remapping
    KVM: MMU: trace mmio page fault
    KVM: MMU: mmio page fault support
    KVM: MMU: reorganize struct kvm_shadow_walk_iterator
    KVM: MMU: lockless walking shadow page table
    KVM: MMU: do not need atomicly to set/clear spte
    KVM: MMU: introduce the rules to modify shadow page table
    KVM: MMU: abstract some functions to handle fault pfn
    KVM: MMU: filter out the mmio pfn from the fault pfn
    KVM: MMU: remove bypass_guest_pf
    KVM: MMU: split kvm_mmu_free_page
    KVM: MMU: count used shadow pages on prepareing path
    KVM: MMU: rename 'pt_write' to 'emulate'
    KVM: MMU: cleanup for FNAME(fetch)
    KVM: MMU: optimize to handle dirty bit
    KVM: MMU: cache mmio info on page fault path
    KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
    KVM: MMU: do not update slot bitmap if spte is nonpresent
    KVM: MMU: fix walking shadow page table
    KVM guest: KVM Steal time registration
    ...

    Linus Torvalds
     

23 Jul, 2011

3 commits

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
    sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
    sched: Replace use of entity_key()
    sched: Separate group-scheduling code more clearly
    sched: Reorder root_domain to remove 64 bit alignment padding
    sched: Do not attempt to destroy uninitialized rt_bandwidth
    sched: Remove unused function cpu_cfs_rq()
    sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
    sched, cgroup: Optimize load_balance_fair()
    sched: Don't update shares twice on on_rq parent
    sched: update correct entity's runtime in check_preempt_wakeup()
    xtensa: Use generic config PREEMPT definition
    h8300: Use generic config PREEMPT definition
    m32r: Use generic PREEMPT config
    sched: Skip autogroup when looking for all rt sched groups
    sched: Simplify mutex_spin_on_owner()
    sched: Remove rcu_read_lock() from wake_affine()
    sched: Generalize sleep inside spinlock detection
    sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
    sched: Isolate preempt counting in its own config option
    sched: Remove pointless in_atomic() definition check
    ...

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
    perf: Remove the nmi parameter from the oprofile_perf backend
    x86, perf: Make copy_from_user_nmi() a library function
    perf: Remove perf_event_attr::type check
    x86, perf: P4 PMU - Fix typos in comments and style cleanup
    perf tools: Make test use the preset debugfs path
    perf tools: Add automated tests for events parsing
    perf tools: De-opt the parse_events function
    perf script: Fix display of IP address for non-callchain path
    perf tools: Fix endian conversion reading event attr from file header
    perf tools: Add missing 'node' alias to the hw_cache[] array
    perf probe: Support adding probes on offline kernel modules
    perf probe: Add probed module in front of function
    perf probe: Introduce debuginfo to encapsulate dwarf information
    perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
    perf probe: Remove redundant dwarf functions
    perf probe: Move strtailcmp to string.c
    perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
    tracing/kprobe: Update symbol reference when loading module
    tracing/kprobes: Support module init function probing
    kprobes: Return -ENOENT if probe point doesn't exist
    ...

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    lockdep: Fix lockdep_no_validate against IRQ states
    mutex: Make mutex_destroy() an inline function
    plist: Remove the need to supply locks to plist heads
    lockup detector: Fix reference to the non-existent CONFIG_DETECT_SOFTLOCKUP option

    Linus Torvalds
     

22 Jul, 2011

6 commits

  • Clean up cfs/rt runqueue initialization by moving group scheduling
    related code into the corresponding functions.

    Also, keep group scheduling as an add-on, so that things are only done
    additionally, i. e. remove the init_*_rq() calls from init_tg_*_entry().
    (This removes a redundant initalization during sched_init()).

    In case of group scheduling rt_rq->highest_prio.curr is now initialized
    twice, but adding another #ifdef seems not worth it.

    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310661163-16606-1-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan H. Schönherr
     
  • Reorder root_domain to remove 8 bytes of alignment padding on 64 bit
    builds, this shrinks the size from 1736 to 1728 bytes, therefore using
    one fewer cachelines.

    Signed-off-by: Richard Kennedy
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310726492.1977.5.camel@castor.rsk
    Signed-off-by: Ingo Molnar

    Richard Kennedy
     
  • If a task group is to be created and alloc_fair_sched_group() fails,
    then the rt_bandwidth of the corresponding task group is not yet
    initialized. The caller, sched_create_group(), starts a clean up
    procedure which calls free_rt_sched_group() which unconditionally
    destroys the not yet initialized rt_bandwidth.

    This crashes or hangs the system in lock_hrtimer_base(): UP systems
    dereference a NULL pointer, while SMP systems loop endlessly on a
    condition that cannot become true.

    This patch simply avoids the destruction of rt_bandwidth when the
    initialization code path was not reached.

    (This was discovered by accident with a custom kernel modification.)

    Signed-off-by: Bianca Lutz
    Signed-off-by: Jan Schoenherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310580816-10861-7-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Bianca Lutz
     
  • This patch fixes a typo located in a comment.

    Signed-off-by: Jan Schoenherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310580816-10861-2-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan Schoenherr
     
  • Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
    achieves that load_balance_fair() only iterates those task_groups that
    actually have tasks on busiest, and that we iterate bottom-up, trying to
    move light groups before the heavier ones.

    No idea if it will actually work out to be beneficial in practice, does
    anybody have a cgroup workload that might show a difference one way or
    the other?

    [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Paul Turner
    Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: pick up the latest scheduler fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Jul, 2011

6 commits

  • …l/git/tip/linux-2.6-tip

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    signal: align __lock_task_sighand() irq disabling and RCU
    softirq,rcu: Inform RCU of irq_exit() activity
    sched: Add irq_{enter,exit}() to scheduler_ipi()
    rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
    rcu: Streamline code produced by __rcu_read_unlock()
    rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special
    rcu: decrease rcu_report_exp_rnp coupling with scheduler

    Linus Torvalds
     
  • …ck/linux-2.6-rcu into core/urgent

    Ingo Molnar
     
  • Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
    work. Traditionally we never did any actual work from the resched IPI
    and all magic happened in the return from interrupt path.

    Now that we do do some work, we need to ensure irq_{enter,exit} are
    called so that we don't confuse things.

    This affects things like timekeeping, NO_HZ and RCU, basically
    everything with a hook in irq_enter/exit.

    Explicit examples of things going wrong are:

    sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
    a new reading from GTOD and TSC. Without this
    callback, time is stuck in the past.

    RCU -- needs in_irq() to work in order to avoid some nasty deadlocks

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     
  • When creating sched_domains, stop when we've covered the entire
    target span instead of continuing to create domains, only to
    later find they're redundant and throw them away again.

    This avoids single node systems from touching funny NUMA
    sched_domain creation code and reduces the risks of the new
    SD_OVERLAP code.

    Requested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Cc: Anton Blanchard
    Cc: mahesh@linux.vnet.ibm.com
    Cc: benh@kernel.crashing.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Allow for sched_domain spans that overlap by giving such domains their
    own sched_group list instead of sharing the sched_groups amongst
    each-other.

    This is needed for machines with more than 16 nodes, because
    sched_domain_node_span() will generate a node mask from the
    16 nearest nodes without regard if these masks have any overlap.

    Currently sched_domains have a sched_group that maps to their child
    sched_domain span, and since there is no overlap we share the
    sched_group between the sched_domains of the various CPUs. If however
    there is overlap, we would need to link the sched_group list in
    different ways for each cpu, and hence sharing isn't possible.

    In order to solve this, allocate private sched_groups for each CPU's
    sched_domain but have the sched_groups share a sched_group_power
    structure such that we can uniquely track the power.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to prepare for non-unique sched_groups per domain, we need to
    carry the cpu_power elsewhere, so put a level of indirection in.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Jul, 2011

1 commit

  • Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads
    on 32bit") forgot to initialize min_vruntime_copy which could lead to
    an infinite while loop in task_waking_fair() under some circumstances
    (early boot, lucky timing).

    [ This bug was also reported by others that blamed it on the RCU
    initialization problems ]

    Reported-and-tested-by: Bruno Wolff III
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

14 Jul, 2011

2 commits

  • This patch makes update_rq_clock() aware of steal time.
    The mechanism of operation is not different from irq_time,
    and follows the same principles. This lives in a CONFIG
    option itself, and can be compiled out independently of
    the rest of steal time reporting. The effect of disabling it
    is that the scheduler will still report steal time (that cannot be
    disabled), but won't use this information for cpu power adjustments.

    Everytime update_rq_clock_task() is invoked, we query information
    about how much time was stolen since last call, and feed it into
    sched_rt_avg_update().

    Although steal time reporting in account_process_tick() keeps
    track of the last time we read the steal clock, in prev_steal_time,
    this patch do it independently using another field,
    prev_steal_time_rq. This is because otherwise, information about time
    accounted in update_process_tick() would never reach us in update_rq_clock().

    Signed-off-by: Glauber Costa
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Tested-by: Eric B Munson
    CC: Jeremy Fitzhardinge
    CC: Anthony Liguori
    Signed-off-by: Avi Kivity

    Glauber Costa
     
  • This patch accounts steal time time in account_process_tick.
    If one or more tick is considered stolen in the current
    accounting cycle, user/system accounting is skipped. Idle is fine,
    since the hypervisor does not report steal time if the guest
    is halted.

    Accounting steal time from the core scheduler give us the
    advantage of direct acess to the runqueue data. In a later
    opportunity, it can be used to tweak cpu power and make
    the scheduler aware of the time it lost.

    [avi: doesn't exist on many archs]

    Signed-off-by: Glauber Costa
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Tested-by: Eric B Munson
    CC: Jeremy Fitzhardinge
    CC: Anthony Liguori
    Signed-off-by: Avi Kivity

    Glauber Costa
     

11 Jul, 2011

1 commit


09 Jul, 2011

1 commit


08 Jul, 2011

1 commit

  • This was legacy code brought over from the RT tree and
    is no longer necessary.

    Signed-off-by: Dima Zavin
    Acked-by: Thomas Gleixner
    Cc: Daniel Walker
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Cc: Lai Jiangshan
    Link: http://lkml.kernel.org/r/1310084879-10351-2-git-send-email-dima@android.com
    Signed-off-by: Ingo Molnar

    Dima Zavin
     

01 Jul, 2011

5 commits

  • …ederic/random-tracing into sched/core

    Ingo Molnar
     
  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • It does not make sense to rcu_read_lock/unlock() in every loop
    iteration while spinning on the mutex.

    Move the rcu protection outside the loop. Also simplify the
    return path to always check for lock->owner == NULL which
    meets the requirements of both owner changed and need_resched()
    caused loop exits.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1106101458350.11814@ionos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Merge reason: Move to a (much) newer base.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit c8b28116 ("sched: Increase SCHED_LOAD_SCALE resolution")
    intended to have no user-visible effect, but allows setting
    cpu.shares to < MIN_SHARES, which the user then sees.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Nikhil Rao
    Link: http://lkml.kernel.org/r/1307192600.8618.3.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

23 Jun, 2011

1 commit

  • The sleeping inside spinlock detection is actually used
    for more general sleeping inside atomic sections
    debugging: preemption disabled, rcu read side critical
    sections, interrupts, interrupt disabled, etc...

    Change the name of the config and its help section to
    reflect its more general role.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Acked-by: Randy Dunlap
    Cc: Peter Zijlstra
    Cc: Ingo Molnar

    Frederic Weisbecker
     

10 Jun, 2011

1 commit

  • Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
    of preempt count offset independently. So that the offset
    can be updated by preempt_disable() and preempt_enable()
    even without the need for CONFIG_PREEMPT beeing set.

    This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
    with !CONFIG_PREEMPT where it currently doesn't detect
    code that sleeps inside explicit preemption disabled
    sections.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

08 Jun, 2011

1 commit


07 Jun, 2011

1 commit

  • Sergey reported a CONFIG_PROVE_RCU warning in push_rt_task where
    set_task_cpu() was called with both relevant rq->locks held, which
    should be sufficient for running tasks since holding its rq->lock
    will serialize against sched_move_task().

    Update the comments and fix the task_group() lockdep test.

    Reported-and-tested-by: Sergey Senozhatsky
    Cc: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1307115427.2353.3456.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jun, 2011

1 commit