14 Aug, 2013

1 commit


13 Aug, 2013

1 commit

  • This is only theoretical, but after try_to_wake_up(p) was changed
    to check p->state under p->pi_lock the code like

    __set_current_state(TASK_INTERRUPTIBLE);
    schedule();

    can miss a signal. This is the special case of wait-for-condition,
    it relies on try_to_wake_up/schedule interaction and thus it does
    not need mb() between __set_current_state() and if(signal_pending).

    However, this __set_current_state() can move into the critical
    section protected by rq->lock, now that try_to_wake_up() takes
    another lock we need to ensure that it can't be reordered with
    "if (signal_pending(current))" check inside that section.

    The patch is actually one-liner, it simply adds smp_wmb() before
    spin_lock_irq(rq->lock). This is what try_to_wake_up() already
    does by the same reason.

    We turn this wmb() into the new helper, smp_mb__before_spinlock(),
    for better documentation and to allow the architectures to change
    the default implementation.

    While at it, kill smp_mb__after_lock(), it has no callers.

    Perhaps we can also add smp_mb__before/after_spinunlock() for
    prepare_to_wait().

    Signed-off-by: Oleg Nesterov
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

18 Jul, 2013

1 commit

  • When building the htmldocs (in verbose mode), scripts/kernel-doc
    reports the follwing type of warnings:

    Warning(kernel/sched/core.c:936): No description found for return value of 'task_curr'
    ...

    Fix those by:

    - adding the missing descriptions
    - using "Return" sections for the descriptions

    Signed-off-by: Yacine Belkadi
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373654747-2389-1-git-send-email-yacine.belkadi.1@gmail.com
    [ While at it, fix the cpupri_set() explanation. ]
    Signed-off-by: Ingo Molnar

    Yacine Belkadi
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jul, 2013

1 commit

  • David reported that the HRTICK sched feature was borken; which was enough
    motivation for me to finally fix it ;-)

    We should not allow hrtimer code to do softirq wakeups while holding scheduler
    locks. The hrtimer code only needs this when we accidentally try to program an
    expired time. We don't much care about those anyway since we have the regular
    tick to fall back to.

    Reported-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130628091853.GE29209@dyad.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Jul, 2013

1 commit

  • Merge in a recent upstream commit:

    c2853c8df57f include/linux/math64.h: add div64_ul()

    because:

    72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long

    relies on it.

    [ We don't rebase sched/core for this, because the handful of
    followup commits after the broken commit are not behavioral
    changes so are unlikely to be needed during bisection. ]

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

27 Jun, 2013

3 commits

  • To get the latest runnable info, we need do this cpuload update after
    task_tick.

    Signed-off-by: Alex Shi
    Reviewed-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
    new forked task. Otherwise random values of above variables cause a
    mess when a new task is enqueued:

    enqueue_task_fair
    enqueue_entity
    enqueue_entity_load_avg

    and make fork balancing imbalance due to incorrect load_avg_contrib.

    Further more, Morten Rasmussen notice some tasks were not launched at
    once after created. So Paul and Peter suggest giving a start value for
    new task runnable avg time same as sched_slice().

    PeterZ said:

    > So the 'problem' is that our running avg is a 'floating' average; ie. it
    > decays with time. Now we have to guess about the future of our newly
    > spawned task -- something that is nigh impossible seeing these CPU
    > vendors keep refusing to implement the crystal ball instruction.
    >
    > So there's two asymptotic cases we want to deal well with; 1) the case
    > where the newly spawned program will be 'nearly' idle for its lifetime;
    > and 2) the case where its cpu-bound.
    >
    > Since we have to guess, we'll go for worst case and assume its
    > cpu-bound; now we don't want to make the avg so heavy adjusting to the
    > near-idle case takes forever. We want to be able to quickly adjust and
    > lower our running avg.
    >
    > Now we also don't want to make our avg too light, such that it gets
    > decremented just for the new task not having had a chance to run yet --
    > even if when it would run, it would be more cpu-bound than not.
    >
    > So what we do is we make the initial avg of the same duration as that we
    > guess it takes to run each task on the system at least once -- aka
    > sched_slice().
    >
    > Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
    > case it should be good enough.

    Paul also contributed most of the code comments in this commit.

    Signed-off-by: Alex Shi
    Reviewed-by: Gu Zheng
    Reviewed-by: Paul Turner
    [peterz; added explanation of sched_slice() usage]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
    we can use runnable load variables.

    Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
    patch(introduced in 9ee474f), but also need to revert.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com
    Signed-off-by: Ingo Molnar

    Alex Shi
     

21 Jun, 2013

1 commit


19 Jun, 2013

11 commits

  • Just use struct ctl_table.

    Signed-off-by: Joe Perches
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722
    Signed-off-by: Ingo Molnar

    Joe Perches
     
  • sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't
    useful. In case it is required, then also we need to rearrange the code a bit as
    we already accessed invalid pointer sd to get sg: sg = sd->groups.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • In build_sched_groups() we don't need to call get_group() for cpus
    which are already covered in previous iterations. Calling get_group()
    would mark the group used and eventually leak it since we wouldn't
    connect it and not find it again to free it.

    This will happen only in cases where sg->cpumask contained more than
    one cpu (For any topology level). This patch would free sg's memory
    for all cpus leaving the group leader as the group isn't marked used
    now.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • In the beginning of build_sched_groups() we called sched_domain_span() and
    cached its return value in span. Few statements later we are calling it again to
    get the same pointer.

    Lets use the cached value instead as it hasn't changed in between.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/834ecd507071ad88aff039352dbc7e063dd996a7.1370948150.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • For loop for traversing sched_domain_topology was used at multiple placed in
    core.c. This patch removes code redundancy by creating for_each_sd_topology().

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/e0e04542f54e9464bd9da54f5ccfe62ec6c4c0bc.1370861520.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • Memory for sd is allocated with kzalloc_node() which will initialize its fields
    with zero. In build_sched_domain() we are setting sd->child to child even if
    child is NULL, which isn't required.

    Lets do it only if child isn't NULL.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/f4753a1730051341003ad2ad29a3229c7356678e.1370861520.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • alloc_state will be overwritten by __visit_domain_allocation_hell() and so we
    don't actually need to initialize alloc_state.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/df57734a075cc5ad130e1ae498702e24f2529ab8.1370861520.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • We are saving first scheduling domain for a cpu in build_sched_domains() by
    iterating over the nested sd->child list. We don't actually need to do it this
    way.

    tl will be equal to sched_domain_topology for the first iteration and so we can
    set *per_cpu_ptr(d.sd, i) based on that. So, save pointer to first SD while
    running the iteration loop over tl's.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/fc473527cbc4dfa0b8eeef2a59db74684eb59a83.1370436120.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • build_sched_domain() never uses parameter struct s_data *d and so passing it is
    useless.

    Remove it.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/545e0b4536166a15b4475abcafe5ed0db4ad4a2c.1370436120.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • Merge in fixes before applying ongoing new work.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • I have faced a sequence where the Idle Load Balance was sometime not
    triggered for a while on my platform, in the following scenario:

    CPU 0 and CPU 1 are running tasks and CPU 2 is idle

    CPU 1 kicks the Idle Load Balance
    CPU 1 selects CPU 2 as the new Idle Load Balancer
    CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
    CPU 2 sends a reschedule IPI to CPU 2

    While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

    CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
    task A quickly goes back to sleep (before a tick occurs on CPU 2)
    CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

    Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
    because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
    performed.

    We must wait for the sched softirq to be raised on CPU 2 thanks to another
    part the kernel to come back to clear NOHZ_BALANCE_KICK.

    The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
    we can't raise the sched_softirq for the Idle Load Balance.

    Change since V1:

    - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
    can't run on this CPU (as suggested by Peter)

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

31 May, 2013

1 commit

  • While computing the cputime delta of dynticks CPUs,
    we are mixing up clocks of differents natures:

    * local_clock() which takes care of unstable clock
    sources and fix these if needed.

    * sched_clock() which is the weaker version of
    local_clock(). It doesn't compute any fixup in case
    of unstable source.

    If the clock source is stable, those two clocks are the
    same and we can safely compute the difference against
    two random points.

    Otherwise it results in random deltas as sched_clock()
    can randomly drift away, back or forward, from local_clock().

    As a consequence, some strange behaviour with unstable tsc
    has been observed such as non progressing constant zero cputime.
    (The 'top' command showing no load).

    Fix this by only using local_clock(), or its irq safe/remote
    equivalent, in vtime code.

    Reported-by: Mike Galbraith
    Suggested-by: Mike Galbraith
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

28 May, 2013

4 commits

  • Read the runqueue clock through an accessor. This
    prepares for adding a debugging infrastructure to
    detect missing or redundant calls to update_rq_clock()
    between a scheduler's entry and exit point.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Steven Rostedt
    Cc: Paul Turner
    Cc: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • check_preempt_curr() of fair class needs an uptodate sched clock
    value to update runtime stats of the current task of the target's rq.

    When a task is woken up, activate_task() is usually called right before
    ttwu_do_wakeup() unless the task is still in the runqueue. In the latter
    case we need to update the rq clock explicitly because activate_task()
    isn't here to do the job for us.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Steven Rostedt
    Cc: Paul Turner
    Cc: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1365724262-20142-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Because the sched_class::put_prev_task() callback of rt and fair
    classes are referring to the rq clock to update their runtime
    statistics. There is a missing rq clock update from the CPU
    hotplug notifier's entry point of the scheduler.

    Signed-off-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Steven Rostedt
    Cc: Paul Turner
    Cc: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1365724262-20142-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • migration_call() will do all the things that update_runtime() does.
    So let's remove it.

    Furthermore, there is potential risk that the current code will catch
    BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
    threads running because of enabling runtime twice while the rt_runtime
    may already changed.

    Signed-off-by: Neil Zhang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com
    Signed-off-by: Ingo Molnar

    Neil Zhang
     

07 May, 2013

1 commit

  • This large chunk of load calculation code can be easily divorced
    from the main core.c scheduler file, with only a couple
    prototypes and externs added to a kernel/sched header.

    Some recent commits expanded the code and the documentation of
    it, making it large enough to warrant separation. For example,
    see:

    556061b, "sched/nohz: Fix rq->cpu_load[] calculations"
    5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more"
    5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again"

    More importantly, it helps reduce the size of the main
    sched/core.c by yet another significant amount (~600 lines).

    Signed-off-by: Paul Gortmaker
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com
    Signed-off-by: Ingo Molnar

    Paul Gortmaker
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

04 May, 2013

1 commit

  • The scheduler doesn't yet fully support environments
    with a single task running without a periodic tick.

    In order to ensure we still maintain the duties of scheduler_tick(),
    keep at least 1 tick per second.

    This makes sure that we keep the progression of various scheduler
    accounting and background maintainance even with a very low granularity.
    Examples include cpu load, sched average, CFS entity vruntime,
    avenrun and events such as load balancing, amongst other details
    handled in sched_class::task_tick().

    This limitation will be removed in the future once we get
    these individual items to work in full dynticks CPUs.

    Suggested-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

02 May, 2013

1 commit


01 May, 2013

1 commit

  • One of the problems that arise when converting dedicated custom
    threadpool to workqueue is that the shared worker pool used by workqueue
    anonimizes each worker making it more difficult to identify what the
    worker was doing on which target from the output of sysrq-t or debug
    dump from oops, BUG() and friends.

    This patch implements set_worker_desc() which can be called from any
    workqueue work function to set its description. When the worker task is
    dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
    and so on - the description will be printed out together with the
    workqueue name and the worker function pointer.

    The printing side is implemented by print_worker_info() which is called
    from functions in task dump paths - sched_show_task() and
    dump_stack_print_info(). print_worker_info() can be safely called on
    any task in any state as long as the task struct itself is accessible.
    It uses probe_*() functions to access worker fields. It may print
    garbage if something went very wrong, but it wouldn't cause (another)
    oops.

    The description is currently limited to 24bytes including the
    terminating \0. worker->desc_valid and workder->desc[] are added and
    the 64 bytes marker which was already incorrect before adding the new
    fields is moved to the correct position.

    Here's an example dump with writeback updated to set the bdi name as
    worker desc.

    Hardware name: Bochs
    Modules linked in:
    Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
    Workqueue: writeback bdi_writeback_workfn (flush-8:0)
    ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
    ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
    ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x7f/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] bdi_writeback_workfn+0x2a0/0x3b0
    ...

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Acked-by: Jan Kara
    Cc: Oleg Nesterov
    Cc: Jens Axboe
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

30 Apr, 2013

3 commits

  • Pull SMP/hotplug changes from Ingo Molnar:
    "This is a pretty large, multi-arch series unifying and generalizing
    the various disjunct pieces of idle routines that architectures have
    historically copied from each other and have grown in random, wildly
    inconsistent and sometimes buggy directions:

    101 files changed, 455 insertions(+), 1328 deletions(-)

    this went through a number of review and test iterations before it was
    committed, it was tested on various architectures, was exposed to
    linux-next for quite some time - nevertheless it might cause problems
    on architectures that don't read the mailing lists and don't regularly
    test linux-next.

    This cat herding excercise was motivated by the -rt kernel, and was
    brought to you by Thomas "the Whip" Gleixner."

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    idle: Remove GENERIC_IDLE_LOOP config switch
    um: Use generic idle loop
    ia64: Make sure interrupts enabled when we "safe_halt()"
    sparc: Use generic idle loop
    idle: Remove unused ARCH_HAS_DEFAULT_IDLE
    bfin: Fix typo in arch_cpu_idle()
    xtensa: Use generic idle loop
    x86: Use generic idle loop
    unicore: Use generic idle loop
    tile: Use generic idle loop
    tile: Enter idle with preemption disabled
    sh: Use generic idle loop
    score: Use generic idle loop
    s390: Use generic idle loop
    powerpc: Use generic idle loop
    parisc: Use generic idle loop
    openrisc: Use generic idle loop
    mn10300: Use generic idle loop
    mips: Use generic idle loop
    microblaze: Use generic idle loop
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this development cycle were:

    - full dynticks preparatory work by Frederic Weisbecker

    - factor out the cpu time accounting code better, by Li Zefan

    - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

    - various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched: Fix init NOHZ_IDLE flag
    sched: Prevent to re-select dst-cpu in load_balance()
    sched: Rename load_balance_tmpmask to load_balance_mask
    sched: Move up affinity check to mitigate useless redoing overhead
    sched: Don't consider other cpus in our group in case of NEWLY_IDLE
    sched: Explicitly cpu_idle_type checking in rebalance_domains()
    sched: Change position of resched_cpu() in load_balance()
    sched: Fix wrong rq's runnable_avg update with rt tasks
    sched: Document task_struct::personality field
    sched/cpuacct/UML: Fix header file dependency bug on the UML build
    cgroup: Kill subsys.active flag
    sched/cpuacct: No need to check subsys active state
    sched/cpuacct: Initialize cpuacct subsystem earlier
    sched/cpuacct: Initialize root cpuacct earlier
    sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
    sched/cpuacct: Clean up cpuacct.h
    sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
    sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
    sched/cpuacct: Add cpuacct_acount_field()
    sched/cpuacct: Add cpuacct_init()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "A lot of activities on workqueue side this time. The changes achieve
    the followings.

    - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
    updated to be able to interface with multiple backend worker pools.
    This involved a lot of churning but the end result seems actually
    neater as unbound workqueues are now a lot closer to per-cpu ones.

    - The ability to interface with multiple backend worker pools are
    used to implement unbound workqueues with custom attributes.
    Currently the supported attributes are the nice level and CPU
    affinity. It may be expanded to include cgroup association in
    future. The attributes can be specified either by calling
    apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
    the workqueue in question is exported through sysfs.

    The backend worker pools are keyed by the actual attributes and
    shared by any workqueues which share the same attributes. When
    attributes of a workqueue are changed, the workqueue binds to the
    worker pool with the specified attributes while leaving the work
    items which are already executing in its previous worker pools
    alone.

    This allows converting custom worker pool implementations which
    want worker attribute tuning to use workqueues. The writeback pool
    is already converted in block tree and there are a couple others
    are likely to follow including btrfs io workers.

    - WQ_UNBOUND's ability to bind to multiple worker pools is also used
    to make it NUMA-aware. Because there's no association between work
    item issuer and the specific worker assigned to execute it, before
    this change, using unbound workqueue led to unnecessary cross-node
    bouncing and it couldn't be helped by autonuma as it requires tasks
    to have implicit node affinity and workers are assigned randomly.

    After these changes, an unbound workqueue now binds to multiple
    NUMA-affine worker pools so that queued work items are executed in
    the same node. This is turned on by default but can be disabled
    system-wide or for individual workqueues.

    Crypto was requesting NUMA affinity as encrypting data across
    different nodes can contribute noticeable overhead and doing it
    per-cpu was too limiting for certain cases and IO throughput could
    be bottlenecked by one CPU being fully occupied while others have
    idle cycles.

    While the new features required a lot of changes including
    restructuring locking, it didn't complicate the execution paths much.
    The unbound workqueue handling is now closer to per-cpu ones and the
    new features are implemented by simply associating a workqueue with
    different sets of backend worker pools without changing queue,
    execution or flush paths.

    As such, even though the amount of change is very high, I feel
    relatively safe in that it isn't likely to cause subtle issues with
    basic correctness of work item execution and handling. If something
    is wrong, it's likely to show up as being associated with worker pools
    with the wrong attributes or OOPS while workqueue attributes are being
    changed or during CPU hotplug.

    While this creates more backend worker pools, it doesn't add too many
    more workers unless, of course, there are many workqueues with unique
    combinations of attributes. Assuming everything else is the same,
    NUMA awareness costs an extra worker pool per NUMA node with online
    CPUs.

    There are also a couple things which are being routed outside the
    workqueue tree.

    - block tree pulled in workqueue for-3.10 so that writeback worker
    pool can be converted to unbound workqueue with sysfs control
    exposed. This simplifies the code, makes writeback workers
    NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

    - The conversion to workqueue means that there's no 1:1 association
    between a specific worker, which makes writeback folks unhappy as
    they want to be able to tell which filesystem caused a problem from
    backtrace on systems with many filesystems mounted. This is
    resolved by allowing work items to set debug info string which is
    printed when the task is dumped. As this change involves unifying
    implementations of dump_stack() and friends in arch codes, it's
    being routed through Andrew's -mm tree."

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
    workqueue: use kmem_cache_free() instead of kfree()
    workqueue: avoid false negative WARN_ON() in destroy_workqueue()
    workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
    workqueue: implement NUMA affinity for unbound workqueues
    workqueue: introduce put_pwq_unlocked()
    workqueue: introduce numa_pwq_tbl_install()
    workqueue: use NUMA-aware allocation for pool_workqueues
    workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
    workqueue: map an unbound workqueues to multiple per-node pool_workqueues
    workqueue: move hot fields of workqueue_struct to the end
    workqueue: make workqueue->name[] fixed len
    workqueue: add workqueue->unbound_attrs
    workqueue: determine NUMA node of workers accourding to the allowed cpumask
    workqueue: drop 'H' from kworker names of unbound worker pools
    workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
    workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
    workqueue: fix memory leak in apply_workqueue_attrs()
    workqueue: fix unbound workqueue attrs hashing / comparison
    workqueue: fix race condition in unbound workqueue free path
    workqueue: remove pwq_lock which is no longer used
    ...

    Linus Torvalds
     

29 Apr, 2013

1 commit

  • Pull locking changes from Ingo Molnar:
    "The most noticeable change are mutex speedups from Waiman Long, for
    higher loads. These scalability changes should be most noticeable on
    larger server systems.

    There are also cleanups, fixes and debuggability improvements."

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Consolidate bug messages into a single print_lockdep_off() function
    lockdep: Print out additional debugging advice when we hit lockdep BUGs
    mutex: Back out architecture specific check for negative mutex count
    mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
    mutex: Make more scalable by doing less atomic operations
    mutex: Move mutex spinning code from sched/core.c back to mutex.c
    locking/rtmutex/tester: Set correct permissions on sysfs files
    lockdep: Remove unnecessary 'hlock_next' variable

    Linus Torvalds
     

24 Apr, 2013

1 commit

  • This name doesn't represent specific meaning.
    So rename it to imply it's purpose.

    Signed-off-by: Joonsoo Kim
    Acked-by: Peter Zijlstra
    Tested-by: Jason Low
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     

23 Apr, 2013

3 commits

  • When a task is scheduled in, it may have some properties
    of its own that could make the CPU reconsider the need for
    the tick: posix cpu timers, perf events, ...

    So notify the full dynticks subsystem when a task gets
    scheduled in and re-check the tick dependency at this
    stage. This is done through a self IPI to avoid messing
    up with any current lock scenario.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The scheduler IPI is used by the scheduler to kick
    full dynticks CPUs asynchronously when more than one
    task are running or when a new timer list timer is
    enqueued. This way the destination CPU can decide
    to restart the tick to handle this new situation.

    Now let's call that kick in the scheduler IPI.

    (Reusing the scheduler IPI rather than implementing
    a new IPI was suggested by Peter Zijlstra a while ago)

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Provide a new helper to be called from the full dynticks engine
    before stopping the tick in order to make sure we don't stop
    it when there is more than one task running on the CPU.

    This way we make sure that the tick stays alive to maintain
    fairness.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

19 Apr, 2013

1 commit

  • As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
    feature bit was really just an early hack to make with/without
    mutex-spinning testable. So it is no longer necessary.

    This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
    move the mutex spinning code from kernel/sched/core.c back to
    kernel/mutex.c which is where they should belong.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long