07 Nov, 2011

1 commit

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
    powerpc/p3060qds: Add support for P3060QDS board
    powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
    powerpc/85xx: Make kexec to interate over online cpus
    powerpc/fsl_booke: Fix comment in head_fsl_booke.S
    powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
    powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
    powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
    powerpc/86xx: Correct Gianfar support for GE boards
    powerpc/cpm: Clear muram before it is in use.
    drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
    powerpc/fsl_msi: add support for "msi-address-64" property
    powerpc/85xx: Setup secondary cores PIR with hard SMP id
    powerpc/fsl-booke: Fix settlbcam for 64-bit
    powerpc/85xx: Adding DCSR node to dtsi device trees
    powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
    powerpc/85xx: fix PHYS_64BIT selection for P1022DS
    powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
    powerpc: respect mem= setting for early memory limit setup
    powerpc: Update corenet64_smp_defconfig
    powerpc: Update mpc85xx/corenet 32-bit defconfigs
    ...

    Fix up trivial conflicts in:
    - arch/powerpc/configs/40x/hcu4_defconfig
    removed stale file, edited elsewhere
    - arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
    added opal and gelic drivers vs added ePAPR driver
    - drivers/tty/serial/8250.c
    moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support

    Linus Torvalds
     

26 Oct, 2011

2 commits

  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    llist: Add back llist_add_batch() and llist_del_first() prototypes
    sched: Don't use tasklist_lock for debug prints
    sched: Warn on rt throttling
    sched: Unify the ->cpus_allowed mask copy
    sched: Wrap scheduler p->cpus_allowed access
    sched: Request for idle balance during nohz idle load balance
    sched: Use resched IPI to kick off the nohz idle balance
    sched: Fix idle_cpu()
    llist: Remove cpu_relax() usage in cmpxchg loops
    sched: Convert to struct llist
    llist: Add llist_next()
    irq_work: Use llist in the struct irq_work logic
    llist: Return whether list is empty before adding in llist_add()
    llist: Move cpu_relax() to after the cmpxchg()
    llist: Remove the platform-dependent NMI checks
    llist: Make some llist functions inline
    sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
    sched: Remove redundant test in check_preempt_tick()
    sched: Add documentation for bandwidth control
    sched: Return unused runtime on group dequeue
    ...

    Linus Torvalds
     
  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
    rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
    rcu: Wire up RCU_BOOST_PRIO for rcutree
    rcu: Make rcu_torture_boost() exit loops at end of test
    rcu: Make rcu_torture_fqs() exit loops at end of test
    rcu: Permit rt_mutex_unlock() with irqs disabled
    rcu: Avoid having just-onlined CPU resched itself when RCU is idle
    rcu: Suppress NMI backtraces when stall ends before dump
    rcu: Prohibit grace periods during early boot
    rcu: Simplify unboosting checks
    rcu: Prevent early boot set_need_resched() from __rcu_pending()
    rcu: Dump local stack if cannot dump all CPUs' stacks
    rcu: Move __rcu_read_unlock()'s barrier() within if-statement
    rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
    rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
    rcu: Make rcu_implicit_dynticks_qs() locals be correct size
    rcu: Eliminate in_irq() checks in rcu_enter_nohz()
    nohz: Remove nohz_cpu_mask
    rcu: Document interpretation of RCU-lockdep splats
    rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
    ...

    Linus Torvalds
     

25 Oct, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
    MAINTAINERS: linux-m32r is moderated for non-subscribers
    linux@lists.openrisc.net is moderated for non-subscribers
    Drop default from "DM365 codec select" choice
    parisc: Kconfig: cleanup Kernel page size default
    Kconfig: remove redundant CONFIG_ prefix on two symbols
    cris: remove arch/cris/arch-v32/lib/nand_init.S
    microblaze: add missing CONFIG_ prefixes
    h8300: drop puzzling Kconfig dependencies
    MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
    tty: drop superfluous dependency in Kconfig
    ARM: mxc: fix Kconfig typo 'i.MX51'
    Fix file references in Kconfig files
    aic7xxx: fix Kconfig references to READMEs
    Fix file references in drivers/ide/
    thinkpad_acpi: Fix printk typo 'bluestooth'
    bcmring: drop commented out line in Kconfig
    btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
    doc: raw1394: Trivial typo fix
    CIFS: Don't free volume_info->UNC until we are entirely done with it.
    treewide: Correct spelling of successfully in comments
    ...

    Linus Torvalds
     

06 Oct, 2011

5 commits

  • Avoid taking locks from debug prints, this avoids latencies on -rt,
    and improves reliability of the debug code.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Currently every sched_class::set_cpus_allowed() implementation has to
    copy the cpumask into task_struct::cpus_allowed, this is pointless,
    put this copy in the generic code.

    Signed-off-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This task is preparatory for the migrate_disable() implementation, but
    stands on its own and provides a cleanup.

    It currently only converts those sites required for task-placement.
    Kosaki-san once mentioned replacing cpus_allowed with a proper
    cpumask_t instead of the NR_CPUS sized array it currently is, that
    would also require something like this.

    Signed-off-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: KOSAKI Motohiro
    Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rq's idle_at_tick is set to idle/busy during the timer tick
    depending on the cpu was idle or not. This will be used later in the load
    balance that will be done in the softirq context (which is a process
    context in -RT kernels).

    For nohz kernels, for the cpu doing nohz idle load balance on behalf of
    all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
    recorded when it got the timer tick presumably when it is busy).

    As the nohz idle load balancing is also being done at the same place
    as the regular load balancing, nohz idle load balancing was bailing out
    when it sees rq's idle_at_tick not set.

    Thus leading to poor system utilization.

    Rename rq's idle_at_tick to idle_balance and set it when someone requests
    for nohz idle balance on an idle cpu.

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Current use of smp call function to kick the nohz idle balance can deadlock
    in this scenario.

    1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
    data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI
    to cpu-B to process the call single queue is not yet sent.

    2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
    for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
    and __smp_call_function_single() with nowait will queue the csd to the
    cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
    as the call single queue was not empty.

    3. cpu-A is busy with lot of interrupts

    4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
    it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
    idle load balancer and clear its rq->nohz_balance_kick.

    5. At this point, csd queued as part of the step-2 above is still locked
    and waiting to be serviced on cpu-B.

    6. cpu-A is still busy with interrupt load and now it got another timer
    interrupt and as part of it decided to kick cpu-B for another idle load
    balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
    above) and does __smp_call_function_single() with the same csd that is
    still locked.

    7. And we get a deadlock waiting for the csd_lock() in the
    __smp_call_function_single().

    Main issue here is that cpu-B can service the idle load balancer kick
    request from cpu-A even with out receiving the IPI and this lead to
    doing multiple __smp_call_function_single() on the same csd leading to
    deadlock.

    To kick a cpu, scheduler already has the reschedule vector reserved. Use
    that mechanism (kick_process()) instead of using the generic smp call function
    mechanism to kick off the nohz idle load balancing and avoid the deadlock.

    [ This issue is present from 2.6.35+ kernels, but marking it -stable
    only from v3.0+ as the proposed fix depends on the scheduler_ipi()
    that is introduced recently. ]

    Reported-by: Prarit Bhargava
    Signed-off-by: Suresh Siddha
    Cc: stable@kernel.org # v3.0+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

04 Oct, 2011

3 commits

  • On -rt we observed hackbench waking all 400 tasks to a single cpu.
    This is because of select_idle_sibling()'s interaction with the new
    ipi based wakeup scheme.

    The existing idle_cpu() test only checks to see if the current task on
    that cpu is the idle task, it does not take already queued tasks into
    account, nor does it take queued to be woken tasks into account.

    If the remote wakeup IPIs come hard enough, there won't be time to
    schedule away from the idle task, and would thus keep thinking the cpu
    was in fact idle, regardless of the fact that there were already
    several hundred tasks runnable.

    We couldn't reproduce on mainline, but there's no reason it couldn't
    happen.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Use the generic llist primitives.

    We had a private lockless list implementation in the scheduler in the wake-list
    code, now that we have a generic llist implementation that provides all required
    operations, switch to it.

    This patch is not expected to change any behavior.

    Signed-off-by: Peter Zijlstra
    Cc: Huang Ying
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: pick up the latest fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

30 Sep, 2011

1 commit

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

29 Sep, 2011

2 commits

  • RCU no longer uses this global variable, nor does anyone else. This
    commit therefore removes this variable. This reduces memory footprint
    and also removes some atomic instructions and memory barriers from
    the dyntick-idle path.

    Signed-off-by: Alex Shi
    Signed-off-by: Paul E. McKenney

    Shi, Alex
     
  • Long ago, using TREE_RCU with PREEMPT would result in "scheduling
    while atomic" diagnostics if you blocked in an RCU read-side critical
    section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
    this diagnostic. This commit therefore adds a replacement diagnostic
    based on PROVE_RCU.

    Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
    used for things that have nothing to do with rcu_dereference(), rename
    lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
    argument that is a string indicating what is suspicious. This third
    argument is passed in from a new third argument to rcu_lockdep_assert().
    Update all calls to rcu_lockdep_assert() to add an informative third
    argument.

    Also, add a pair of rcu_lockdep_assert() calls from within
    rcu_note_context_switch(), one complaining if a context switch occurs
    in an RCU-bh read-side critical section and another complaining if a
    context switch occurs in an RCU-sched read-side critical section.
    These are present only if the PROVE_RCU kernel parameter is enabled.

    Finally, fix some checkpatch whitespace complaints in lockdep.c.

    Again, you must enable PROVE_RCU to see these new diagnostics. But you
    are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

28 Sep, 2011

1 commit


26 Sep, 2011

1 commit

  • Commit c259e01a1ec ("sched: Separate the scheduler entry for
    preemption") contained a boo-boo wrecking wchan output. It forgot to
    put the new schedule() function in the __sched section and thereby
    doesn't get properly ignored for things like wchan.

    Tested-by: Simon Kirby
    Cc: stable@kernel.org # 2.6.39+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
    Signed-off-by: Ingo Molnar

    Simon Kirby
     

20 Sep, 2011

1 commit


18 Sep, 2011

1 commit


08 Sep, 2011

1 commit


29 Aug, 2011

4 commits

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch fixes the following memory leak:

    unreferenced object 0xffff880107266800 (size 512):
    comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] create_object+0x187/0x28b
    [] kmemleak_alloc+0x73/0x98
    [] __kmalloc_node+0x104/0x159
    [] kzalloc_node.clone.97+0x15/0x17
    [] build_sched_domains+0xb7/0x7f3
    [] partition_sched_domains+0x1db/0x24a
    [] do_rebuild_sched_domains+0x3b/0x47
    [] rebuild_sched_domains+0x10/0x12
    [] sched_power_savings_store+0x6c/0x7b
    [] sched_mc_power_savings_store+0x16/0x18
    [] sysdev_class_store+0x20/0x22
    [] sysfs_write_file+0x108/0x144
    [] vfs_write+0xaf/0x102
    [] sys_write+0x4d/0x74
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    Signed-off-by: WANG Cong
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org # 3.0
    Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
    Signed-off-by: Ingo Molnar

    WANG Cong
     
  • There is no real reason to run blk_schedule_flush_plug() with
    interrupts and preemption disabled.

    Move it into schedule() and call it when the task is going voluntarily
    to sleep. There might be false positives when the task is woken
    between that call and actually scheduling, but that's not really
    different from being woken immediately after switching away.

    This fixes a deadlock in the scheduler where the
    blk_schedule_flush_plug() callchain enables interrupts and thereby
    allows a wakeup to happen of the task that's going to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Block-IO and workqueues call into notifier functions from the
    scheduler core code with interrupts and preemption disabled. These
    calls should be made before entering the scheduler core.

    To simplify this, separate the scheduler core code into
    __schedule(). __schedule() is directly called from the places which
    set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
    checks into schedule(), so they are only called when a task voluntary
    goes to sleep.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: stable@kernel.org # 2.6.39+
    Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

14 Aug, 2011

15 commits

  • When a local cfs_rq blocks we return the majority of its remaining quota to the
    global bandwidth pool for use by other runqueues.

    We do this only when the quota is current and there is more than
    min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

    In the case where there are throttled runqueues and we have sufficient
    bandwidth to meter out a slice, a second timer is kicked off to handle this
    delivery, unthrottling where appropriate.

    Using a 'worst case' antagonist which executes on each cpu
    for 1ms before moving onto the next on a fairly large machine:

    no quota generations:

    197.47 ms /cgroup/a/cpuacct.usage
    199.46 ms /cgroup/a/cpuacct.usage
    205.46 ms /cgroup/a/cpuacct.usage
    198.46 ms /cgroup/a/cpuacct.usage
    208.39 ms /cgroup/a/cpuacct.usage

    Since we are allowed to use "stale" quota our usage is effectively bounded by
    the rate of input into the global pool and performance is relatively stable.

    with quota generations [1s increments]:

    119.58 ms /cgroup/a/cpuacct.usage
    119.65 ms /cgroup/a/cpuacct.usage
    119.64 ms /cgroup/a/cpuacct.usage
    119.63 ms /cgroup/a/cpuacct.usage
    119.60 ms /cgroup/a/cpuacct.usage

    The large deficit here is due to quota generations (/intentionally/) preventing
    us from now using previously stranded slack quota. The cost is that this quota
    becomes unavailable.

    with quota generations and quota return:

    200.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    198.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    200.06 ms /cgroup/a/cpuacct.usage

    By returning unused quota we're able to both stably consume our desired quota
    and prevent unintentional overages due to the abuse of slack quota from
    previous quota periods (especially on a large machine).

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This change introduces statistics exports for the cpu sub-system, these are
    added through the use of a stat file similar to that exported by other
    subsystems.

    The following exports are included:

    nr_periods: number of periods in which execution occurred
    nr_throttled: the number of periods above in which execution was throttle
    throttled_time: cumulative wall-time that any cpus have been throttled for
    this group

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • Throttled tasks are invisisble to cpu-offline since they are not eligible for
    selection by pick_next_task(). The regular 'escape' path for a thread that is
    blocked at offline is via ttwu->select_task_rq, however this will not handle a
    throttled group since there are no individual thread wakeups on an unthrottle.

    Resolve this by unthrottling offline cpus so that threads can be migrated.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • From the perspective of load-balance and shares distribution, throttled
    entities should be invisible.

    However, both of these operations work on 'active' lists and are not
    inherently aware of what group hierarchies may be present. In some cases this
    may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
    while in others (e.g. update_shares()) it is more difficult to compute without
    incurring some O(n^2) costs.

    Instead, track hierarchicaal throttled state at time of transition. This
    allows us to easily identify whether an entity belongs to a throttled hierarchy
    and avoid incorrect interactions with it.

    Also, when an entity leaves a throttled hierarchy we need to advance its
    time averaging for shares averaging so that the elapsed throttled time is not
    considered as part of the cfs_rq's operation.

    We also use this information to prevent buddy interactions in the wakeup and
    yield_to() paths.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Extend walk_tg_tree to accept a positional argument

    static int walk_tg_tree_from(struct task_group *from,
    tg_visitor down, tg_visitor up, void *data)

    Existing semantics are preserved, caller must hold rcu_lock() or sufficient
    analogue.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • At the start of each period we refresh the global bandwidth pool. At this time
    we must also unthrottle any cfs_rq entities who are now within bandwidth once
    more (as quota permits).

    Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
    and their entities re-enqueued.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Now that consumption is tracked (via update_curr()) we add support to throttle
    group entities (and their corresponding cfs_rqs) in the case where this is no
    run-time remaining.

    Throttled entities are dequeued to prevent scheduling, additionally we mark
    them as throttled (using cfs_rq->throttled) to prevent them from becoming
    re-enqueued until they are unthrottled. A list of a task_group's throttled
    entities are maintained on the cfs_bandwidth structure.

    Note: While the machinery for throttling is added in this patch the act of
    throttling an entity exceeding its bandwidth is deferred until later within
    the series.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since quota is managed using a global state but consumed on a per-cpu basis
    we need to ensure that our per-cpu state is appropriately synchronized.
    Most importantly, runtime that is state (from a previous period) should not be
    locally consumable.

    We take advantage of existing sched_clock synchronization about the jiffy to
    efficiently detect whether we have (globally) crossed a quota boundary above.

    One catch is that the direction of spread on sched_clock is undefined,
    specifically, we don't know whether our local clock is behind or ahead
    of the one responsible for the current expiration time.

    Fortunately we can differentiate these by considering whether the
    global deadline has advanced. If it has not, then we assume our clock to be
    "fast" and advance our local expiration; otherwise, we know the deadline has
    truly passed and we expire our local runtime.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This patch adds a per-task_group timer which handles the refresh of the global
    CFS bandwidth pool.

    Since the RT pool is using a similar timer there's some small refactoring to
    share this support.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Account bandwidth usage on the cfs_rq level versus the task_groups to which
    they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
    under cfs_rq->runtime_enabled.

    cfs_rq's which belong to a bandwidth constrained task_group have their runtime
    accounted via the update_curr() path, which withdraws bandwidth from the global
    pool as desired. Updates involving the global pool are currently protected
    under cfs_bandwidth->lock, local runtime is protected by rq->lock.

    This patch only assigns and tracks quota, no action is taken in the case that
    cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Add constraints validation for CFS bandwidth hierarchies.

    Validate that:
    max(child bandwidth)
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • In this patch we introduce the notion of CFS bandwidth, partitioned into
    globally unassigned bandwidth, and locally claimed bandwidth.

    - The global bandwidth is per task_group, it represents a pool of unclaimed
    bandwidth that cfs_rqs can allocate from.
    - The local bandwidth is tracked per-cfs_rq, this represents allotments from
    the global pool bandwidth assigned to a specific cpu.

    Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
    - cpu.cfs_period_us : the bandwidth period in usecs
    - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
    to consume over period above.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Introduce hierarchical task accounting for the group scheduling case in CFS, as
    well as promoting the responsibility for maintaining rq->nr_running to the
    scheduling classes.

    The primary motivation for this is that with scheduling classes supporting
    bandwidth throttling it is possible for entities participating in throttled
    sub-trees to not have root visible changes in rq->nr_running across activate
    and de-activate operations. This in turn leads to incorrect idle and
    weight-per-task load balance decisions.

    This also allows us to make a small fixlet to the fastpath in pick_next_task()
    under group scheduling.

    Note: this issue also exists with the existing sched_rt throttling mechanism.
    This patch does not address that.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
    been handled for an RT parent gives birth to a deranged mutant child with
    non-RT policy, but RT prio and sched_class.

    Move PI leakage protection up, always set priorities and weight, and if the
    child is leaving RT class, reset rt_priority to the proper value.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Since commit a2d47777 ("sched: fix stale value in average load per task")
    the variable rq->avg_load_per_task is no longer required. Remove it.

    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan H. Schönherr
     

26 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds