14 Jan, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

07 Jan, 2011

5 commits

  • One of the operands, buf, is incorrect, since it is stripped and the
    correct address for subsequent string comparing could change if
    leading white spaces, if any, are removed from buf.

    It is fixed by replacing buf with cmp.

    Signed-off-by: Hillf Danton
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • Seems I lost a change somewhere, leaking memory.

    sched: fix struct autogroup memory leak

    Add missing change to actually use autogroup_free().

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • root_task_group is the leftover of USER_SCHED, now it's always
    same to init_task_group.
    But as Mike suggested, root_task_group is maybe the suitable name
    to keep for a tree.
    So in this patch:
    init_task_group --> root_task_group
    init_task_group_load --> root_task_group_load
    INIT_TASK_GROUP_LOAD --> ROOT_TASK_GROUP_LOAD

    Suggested-by: Mike Galbraith
    Signed-off-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yong Zhang
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    sched: Change wait_for_completion_*_timeout() to return a signed long
    sched, autogroup: Fix reference leak
    sched, autogroup: Fix potential access to freed memory
    sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
    sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
    sched: Move periodic share updates to entity_tick()
    printk: Use this_cpu_{read|write} api on printk_pending
    sched: Make pushable_tasks CONFIG_SMP dependant
    sched: Add 'autogroup' scheduling feature: automated per session task groups
    sched: Fix unregister_fair_sched_group()
    sched: Remove unused argument dest_cpu to migrate_task()
    mutexes, sched: Introduce arch_mutex_cpu_relax()
    sched: Add some clock info to sched_debug
    cpu: Remove incorrect BUG_ON
    cpu: Remove unused variable
    sched: Fix UP build breakage
    sched: Make task dump print all 15 chars of proc comm
    sched: Update tg->shares after cpu.shares write
    sched: Allow update_cfs_load() to update global load
    sched: Implement demand based update_cfs_load()
    ...

    Linus Torvalds
     
  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (146 commits)
    tools, perf: Documentation for the power events API
    perf: Add calls to suspend trace point
    perf script: Make some lists static
    perf script: Use the default lost event handler
    perf session: Warn about errors when processing pipe events too
    perf tools: Fix perf_event.h header usage
    perf test: Clarify some error reports in the open syscall test
    x86, NMI: Add touch_nmi_watchdog to io_check_error delay
    x86: Avoid calling arch_trigger_all_cpu_backtrace() at the same time
    x86: Only call smp_processor_id in non-preempt cases
    perf timechart: Adjust perf timechart to the new power events
    perf: Clean up power events by introducing new, more generic ones
    perf: Do not export power_frequency, but power_start event
    perf test: Add test for counting open syscalls
    perf evsel: Auto allocate resources needed for some methods
    perf evsel: Use {cpu,thread}_map to shorten list of parameters
    perf tools: Refactor all_tids to hold nr and the map
    perf tools: Refactor cpumap to hold nr and the map
    perf evsel: Introduce per cpu and per thread open helpers
    perf evsel: Steal the counter reading routines from stat
    ...

    Linus Torvalds
     

05 Jan, 2011

2 commits

  • wait_for_completion_*_timeout() can return:

    0: if the wait timed out
    -ve: if the wait was interrupted
    +ve: if the completion was completed.

    As they currently return an 'unsigned long', the last two cases
    are not easily distinguished which can easily result in buggy
    code, as is the case for the recently added
    wait_for_completion_interruptible_timeout() call in
    net/sunrpc/cache.c

    So change them both to return 'long'. As MAX_SCHEDULE_TIMEOUT
    is LONG_MAX, a large +ve return value should never overflow.

    Signed-off-by: NeilBrown
    Cc: Peter Zijlstra
    Cc: J. Bruce Fields
    Cc: Andrew Morton
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    NeilBrown
     
  • Merge reason: Merge the final .37 tree.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Jan, 2011

1 commit


23 Dec, 2010

2 commits


22 Dec, 2010

1 commit


20 Dec, 2010

1 commit

  • Linus reported that the new warning introduced by commit f26f9aff6aaf
    "Sched: fix skip_clock_update optimization" triggers. The need_resched
    flag can be set by other CPUs asynchronously so this debug check is
    bogus - remove it.

    Reported-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Dec, 2010

3 commits

  • Currently we call perf_event_init() from sched_init(). In order to
    make it more obvious move it to the cannnonical location.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since the irqtime accounting is using non-atomic u64 and can be read
    from remote cpus (writes are strictly cpu local, reads are not) we
    have to deal with observing partial updates.

    When we do observe partial updates the clock movement (in particular,
    ->clock_task movement) will go funny (in either direction), a
    subsequent clock update (observing the full update) will make it go
    funny in the oposite direction.

    Since we rely on these clocks to be strictly monotonic we cannot
    suffer backwards motion. One possible solution would be to simply
    ignore all backwards deltas, but that will lead to accounting
    artefacts, most notable: clock_task + irq_time != clock, this
    inaccuracy would end up in user visible stats.

    Therefore serialize the reads using a seqcount.

    Reviewed-by: Venkatesh Pallipadi
    Reported-by: Mikael Pettersson
    Tested-by: Mikael Pettersson
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Some ARM systems have a short sched_clock() [ which needs to be fixed
    too ], but this exposed a bug in the irq_time code as well, it doesn't
    deal with wraps at all.

    Fix the irq_time code to deal with u64 wraps by re-writing the code to
    only use delta increments, which avoids the whole issue.

    Reviewed-by: Venkatesh Pallipadi
    Reported-by: Mikael Pettersson
    Tested-by: Mikael Pettersson
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Dec, 2010

3 commits

  • As noted by Peter Zijlstra at https://lkml.org/lkml/2010/11/10/391
    (while reviewing other stuff, though), tracking pushable tasks
    only makes sense on SMP systems.

    Signed-off-by: Dario Faggioli
    Acked-by: Steven Rostedt
    Acked-by: Gregory Haskins
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • idle_balance() drops/retakes rq->lock, leaving the previous task
    vulnerable to set_tsk_need_resched(). Clear it after we return
    from balancing instead, and in setup_thread_stack() as well, so
    no successfully descheduled or never scheduled task has it set.

    Need resched confused the skip_clock_update logic, which assumes
    that the next call to update_rq_clock() will come nearly immediately
    after being set. Make the optimization robust against the waking
    a sleeper before it sucessfully deschedules case by checking that
    the current task has not been dequeued before setting the flag,
    since it is that useless clock update we're trying to save, and
    clear unconditionally in schedule() proper instead of conditionally
    in put_prev_task().

    Signed-off-by: Mike Galbraith
    Reported-by: Bjoern B. Brandenburg
    Tested-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • There's a long-running regression that proved difficult to fix and
    which is hitting certain people and is rather annoying in its effects.

    Damien reported that after 74f5187ac8 (sched: Cure load average vs
    NO_HZ woes) his load average is unnaturally high, he also noted that
    even with that patch reverted the load avgerage numbers are not
    correct.

    The problem is that the previous patch only solved half the NO_HZ
    problem, it addressed the part of going into NO_HZ mode, not of
    comming out of NO_HZ mode. This patch implements that missing half.

    When comming out of NO_HZ mode there are two important things to take
    care of:

    - Folding the pending idle delta into the global active count.
    - Correctly aging the averages for the idle-duration.

    So with this patch the NO_HZ interaction should be complete and
    behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

    Furthermore, this patch slightly changes the load average computation
    by adding a rounding term to the fixed point multiplication.

    Reported-by: Damien Wyart
    Reported-by: Tim McGrath
    Tested-by: Damien Wyart
    Tested-by: Orion Poplawski
    Tested-by: Kyle McMartin
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Cc: Chase Douglas
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Nov, 2010

3 commits

  • A recurring complaint from CFS users is that parallel kbuild has
    a negative impact on desktop interactivity. This patch
    implements an idea from Linus, to automatically create task
    groups. Currently, only per session autogroups are implemented,
    but the patch leaves the way open for enhancement.

    Implementation: each task's signal struct contains an inherited
    pointer to a refcounted autogroup struct containing a task group
    pointer, the default for all tasks pointing to the
    init_task_group. When a task calls setsid(), a new task group
    is created, the process is moved into the new task group, and a
    reference to the preveious task group is dropped. Child
    processes inherit this task group thereafter, and increase it's
    refcount. When the last thread of a process exits, the
    process's reference is dropped, such that when the last process
    referencing an autogroup exits, the autogroup is destroyed.

    At runqueue selection time, IFF a task has no cgroup assignment,
    its current autogroup is used.

    Autogroup bandwidth is controllable via setting it's nice level
    through the proc filesystem:

    cat /proc//autogroup

    Displays the task's group and the group's nice level.

    echo > /proc//autogroup

    Sets the task group's shares to the weight of nice task.
    Setting nice level is rate limited for !admin users due to the
    abuse risk of task group locking.

    The feature is enabled from boot by default if
    CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
    the boot option noautogroup, and can also be turned on/off on
    the fly via:

    echo [01] > /proc/sys/kernel/sched_autogroup_enabled

    ... which will automatically move tasks to/from the root task group.

    Signed-off-by: Mike Galbraith
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Markus Trippelsdorf
    Cc: Mathieu Desnoyers
    Cc: Paul Turner
    Cc: Oleg Nesterov
    [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
    Signed-off-by: Ingo Molnar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • In the flipping and flopping between calling
    unregister_fair_sched_group() on a per-cpu versus per-group basis
    we ended up in a bad state.

    Remove from the list for the passed cpu as opposed to some
    arbitrary index.

    ( This fixes explosions w/ autogroup as well as a group
    creation/destruction stress test. )

    Reported-by: Stephen Rothwell
    Signed-off-by: Paul Turner
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • The first version of synchronize_sched_expedited() used the migration
    code in the scheduler, and was therefore implemented in kernel/sched.c.
    However, the more recent version of this code no longer uses the
    migration code, so this commit moves it to the main RCU source files.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Paul E. McKenney

    Lai Jiangshan
     

26 Nov, 2010

3 commits

  • Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue,
    as it is always known at the call site.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Nikanth Karthikesan
     
  • The spinning mutex implementation uses cpu_relax() in busy loops as a
    compiler barrier. Depending on the architecture, cpu_relax() may do more
    than needed in this specific mutex spin loops. On System z we also give
    up the time slice of the virtual cpu in cpu_relax(), which prevents
    effective spinning on the mutex.

    This patch replaces cpu_relax() in the spinning mutex code with
    arch_mutex_cpu_relax(), which can be defined by each architecture that
    selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
    this patch should not affect other architectures than System z for now.

    Signed-off-by: Gerald Schaefer
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gerald Schaefer
     
  • Merge reason: Pick up latest fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

23 Nov, 2010

1 commit


18 Nov, 2010

8 commits

  • Formerly sched_group_set_shares would force a rebalance by overflowing domain
    share sums. Now that per-cpu averages are maintained we can set the true value
    by issuing an update_cfs_shares() following a tg->shares update.

    Also initialize tg se->load to 0 for consistency since we'll now set correct
    weights on enqueue.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • When the system is busy, dilation of rq->next_balance makes lb->update_shares()
    insufficiently frequent for threads which don't sleep (no dequeue/enqueue
    updates). Adjust for this by making demand based updates based on the
    accumulation of execution time sufficient to wrap our averaging window.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
    the put path since nr_running accounting occurs at deactivation.

    It's also not safe to make the removal decision based on load_avg as this fails
    with both high periods and low shares. Resolve this by clipping history after
    4 periods without activity.

    Note: the above will always occur from update_shares() since in the
    last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load
    is called.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Make tg_shares_up() use the active cgroup list, this means we cannot
    do a strict bottom-up walk of the hierarchy, but assuming its a very
    wide tree with a small number of active groups it should be a win.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make certain load-balance actions scale per number of active cgroups
    instead of the number of existing cgroups.

    This makes wakeup/sleep paths more expensive, but is a win for systems
    where the vast majority of existing cgroups are idle.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While discussing the need for sched_idle_next(), Oleg remarked that
    since try_to_wake_up() ensures sleeping tasks will end up running on a
    sane cpu, we can do away with migrate_live_tasks().

    If we then extend the existing hack of migrating current from
    CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
    need for the sched_idle_next() abomination disappears as well, since
    idle will be the only possible thread left after the migration thread
    stops.

    This greatly simplifies the hot-unplug task migration path, as can be
    seen from the resulting code reduction (and about half the new lines
    are comments).

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: Move to a .37-rc base.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Nov, 2010

2 commits

  • Instead of dealing with sched classes inside each check_preempt_curr()
    implementation, pull out this logic into the generic wakeup preemption
    path.

    This fixes a hang in KVM (and others) where we are waiting for the
    stop machine thread to run ...

    Reported-by: Markus Trippelsdorf
    Tested-by: Marcelo Tosatti
    Tested-by: Sergey Senozhatsky
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we consider a sched domain to be well balanced when the imbalance
    is less than the domain's imablance_pct. As the number of cores and threads
    are increasing, current values of imbalance_pct (for example 25% for a
    NUMA domain) are not enough to detect imbalances like:

    a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
    24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
    socket. Leading to an idle HT cpu.

    b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
    16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
    socket and 7 on another socket. Leaving one core in a socket idle
    whereas in another socket we have a core having both its HT siblings busy.

    While this issue can be fixed by decreasing the domain's imbalance_pct
    (by making it a function of number of logical cpus in the domain), it
    can potentially cause more task migrations across sched groups in an
    overloaded case.

    Fix this by using imbalance_pct only during newly_idle and busy
    load balancing. And during idle load balancing, check if there
    is an imbalance in number of idle cpu's across the busiest and this
    sched_group or if the busiest group has more tasks than its weight that
    the idle cpu in this_group can pull.

    Reported-by: Nikhil Rao
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

02 Nov, 2010

1 commit

  • "gadget", "through", "command", "maintain", "maintain", "controller", "address",
    "between", "initiali[zs]e", "instead", "function", "select", "already",
    "equal", "access", "management", "hierarchy", "registration", "interest",
    "relative", "memory", "offset", "already",

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Jiri Kosina

    Uwe Kleine-König
     

29 Oct, 2010

1 commit


23 Oct, 2010

1 commit


22 Oct, 2010

1 commit