02 Oct, 2013

2 commits

  • commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream.

    There is a small race between copy_process() and cgroup_attach_task()
    where child->se.parent,cfs_rq points to invalid (old) ones.

    parent doing fork() | someone moving the parent to another cgroup
    -------------------------------+---------------------------------------------
    copy_process()
    + dup_task_struct()
    -> parent->se is copied to child->se.
    se.parent,cfs_rq of them point to old ones.

    cgroup_attach_task()
    + cgroup_task_migrate()
    -> parent->cgroup is updated.
    + cpu_cgroup_attach()
    + sched_move_task()
    + task_move_group_fair()
    +- set_task_rq()
    -> se.parent,cfs_rq of parent
    are updated.

    + cgroup_fork()
    -> parent->cgroup is copied to child->cgroup. (*1)
    + sched_fork()
    + task_fork_fair()
    -> se.parent,cfs_rq of child are accessed
    while they point to old ones. (*2)

    In the worst case, this bug can lead to "use-after-free" and cause a panic,
    because it's new cgroup's refcount that is incremented at (*1),
    so the old cgroup(and related data) can be freed before (*2).

    In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
    [] place_entity+0x75/0xa0
    [] task_fork_fair+0xaa/0x160
    [] sched_fork+0x6b/0x140
    [] copy_process+0x5b2/0x1450
    [] ? wake_up_new_task+0xd9/0x130
    [] do_fork+0x94/0x460
    [] ? sys_wait4+0xae/0x100
    [] sys_clone+0x28/0x30
    [] stub_clone+0x13/0x20
    [] ? system_call_fastpath+0x16/0x1b

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Daisuke Nishimura
     
  • commit 5a8e01f8fa51f5cbce8f37acc050eb2319d12956 upstream.

    scale_stime() silently assumes that stime < rtime, otherwise
    when stime == rtime and both values are big enough (operations
    on them do not fit in 32 bits), the resulting scaling stime can
    be bigger than rtime. In consequence utime = rtime - stime
    results in negative value.

    User space visible symptoms of the bug are overflowed TIME
    values on ps/top, for example:

    $ ps aux | grep rcu
    root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
    root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
    root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
    root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
    root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
    root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]

    or overflowed utime values read directly from /proc/$PID/stat

    Reference:

    https://lkml.org/lkml/2013/8/20/259

    Reported-and-tested-by: Sergey Senozhatsky
    Signed-off-by: Stanislaw Gruszka
    Cc: stable@vger.kernel.org
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Stanislaw Gruszka
     

20 Aug, 2013

1 commit

  • commit bf0bd948d1682e3996adc093b43021ed391983e6 upstream.

    We typically update a task_group's shares within the dequeue/enqueue
    path. However, continuously running tasks sharing a CPU are not
    subject to these updates as they are only put/picked. Unfortunately,
    when we reverted f269ae046 (in 17bc14b7), we lost the augmenting
    periodic update that was supposed to account for this; resulting in a
    potential loss of fairness.

    To fix this, re-introduce the explicit update in
    update_cfs_rq_blocked_load() [called via entity_tick()].

    Reported-by: Max Hailperin
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Paul Turner
    Link: http://lkml.kernel.org/n/tip-9545m3apw5d93ubyrotrj31y@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

21 Jun, 2013

1 commit


19 Jun, 2013

1 commit

  • I have faced a sequence where the Idle Load Balance was sometime not
    triggered for a while on my platform, in the following scenario:

    CPU 0 and CPU 1 are running tasks and CPU 2 is idle

    CPU 1 kicks the Idle Load Balance
    CPU 1 selects CPU 2 as the new Idle Load Balancer
    CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
    CPU 2 sends a reschedule IPI to CPU 2

    While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

    CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
    task A quickly goes back to sleep (before a tick occurs on CPU 2)
    CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

    Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
    because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
    performed.

    We must wait for the sched softirq to be raised on CPU 2 thanks to another
    part the kernel to come back to clear NOHZ_BALANCE_KICK.

    The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
    we can't raise the sched_softirq for the Idle Load Balance.

    Change since V1:

    - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
    can't run on this CPU (as suggested by Peter)

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

31 May, 2013

1 commit

  • While computing the cputime delta of dynticks CPUs,
    we are mixing up clocks of differents natures:

    * local_clock() which takes care of unstable clock
    sources and fix these if needed.

    * sched_clock() which is the weaker version of
    local_clock(). It doesn't compute any fixup in case
    of unstable source.

    If the clock source is stable, those two clocks are the
    same and we can safely compute the difference against
    two random points.

    Otherwise it results in random deltas as sched_clock()
    can randomly drift away, back or forward, from local_clock().

    As a consequence, some strange behaviour with unstable tsc
    has been observed such as non progressing constant zero cputime.
    (The 'top' command showing no load).

    Fix this by only using local_clock(), or its irq safe/remote
    equivalent, in vtime code.

    Reported-by: Mike Galbraith
    Suggested-by: Mike Galbraith
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

04 May, 2013

1 commit

  • The scheduler doesn't yet fully support environments
    with a single task running without a periodic tick.

    In order to ensure we still maintain the duties of scheduler_tick(),
    keep at least 1 tick per second.

    This makes sure that we keep the progression of various scheduler
    accounting and background maintainance even with a very low granularity.
    Examples include cpu load, sched average, CFS entity vruntime,
    avenrun and events such as load balancing, amongst other details
    handled in sched_class::task_tick().

    This limitation will be removed in the future once we get
    these individual items to work in full dynticks CPUs.

    Suggested-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

03 May, 2013

1 commit


02 May, 2013

2 commits

  • The full dynticks tree needs the latest RCU and sched
    upstream updates in order to fix some dependencies.

    Merge a common upstream merge point that has these
    updates.

    Conflicts:
    include/linux/perf_event.h
    kernel/rcutree.h
    kernel/rcutree_plugin.h

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

01 May, 2013

4 commits

  • One of the problems that arise when converting dedicated custom
    threadpool to workqueue is that the shared worker pool used by workqueue
    anonimizes each worker making it more difficult to identify what the
    worker was doing on which target from the output of sysrq-t or debug
    dump from oops, BUG() and friends.

    This patch implements set_worker_desc() which can be called from any
    workqueue work function to set its description. When the worker task is
    dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
    and so on - the description will be printed out together with the
    workqueue name and the worker function pointer.

    The printing side is implemented by print_worker_info() which is called
    from functions in task dump paths - sched_show_task() and
    dump_stack_print_info(). print_worker_info() can be safely called on
    any task in any state as long as the task struct itself is accessible.
    It uses probe_*() functions to access worker fields. It may print
    garbage if something went very wrong, but it wouldn't cause (another)
    oops.

    The description is currently limited to 24bytes including the
    terminating \0. worker->desc_valid and workder->desc[] are added and
    the 64 bytes marker which was already incorrect before adding the new
    fields is moved to the correct position.

    Here's an example dump with writeback updated to set the bdi name as
    worker desc.

    Hardware name: Bochs
    Modules linked in:
    Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
    Workqueue: writeback bdi_writeback_workfn (flush-8:0)
    ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
    ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
    ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x7f/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] bdi_writeback_workfn+0x2a0/0x3b0
    ...

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Acked-by: Jan Kara
    Cc: Oleg Nesterov
    Cc: Jens Axboe
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Dave Hansen reported strange utime/stime values on his system:
    https://lkml.org/lkml/2013/4/4/435

    This happens because prev->stime value is bigger than rtime
    value. Root of the problem are non-monotonic rtime values (i.e.
    current rtime is smaller than previous rtime) and that should be
    debugged and fixed.

    But since problem did not manifest itself before commit
    62188451f0d63add7ad0cd2a1ae269d600c1663d "cputime: Avoid
    multiplication overflow on utime scaling", it should be threated
    as regression, which we can easily fixed on cputime_adjust()
    function.

    For now, let's apply this fix, but further work is needed to fix
    root of the problem.

    Reported-and-tested-by: Dave Hansen
    Cc: # 3.9+
    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Due to rounding in scale_stime(), for big numbers, scaled stime
    values will grow in chunks. Since rtime grow in jiffies and we
    calculate utime like below:

    prev->stime = max(prev->stime, stime);
    prev->utime = max(prev->utime, rtime - prev->stime);

    we could erroneously account stime values as utime. To prevent
    that only update prev->{u,s}time values when they are smaller
    than current rtime.

    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Here is patch, which adds Linus's cputime scaling algorithm to the
    kernel.

    This is a follow up (well, fix) to commit
    d9a3c9823a2e6a543eb7807fb3d15d8233817ec5 ("sched: Lower chances
    of cputime scaling overflow") which commit tried to avoid
    multiplication overflow, but did not guarantee that the overflow
    would not happen.

    Linus crated a different algorithm, which completely avoids the
    multiplication overflow by dropping precision when numbers are
    big.

    It was tested by me and it gives good relative error of
    scaled numbers. Testing method is described here:
    http://marc.info/?l=linux-kernel&m=136733059505406&w=2

    Originally-From: Linus Torvalds
    Signed-off-by: Stanislaw Gruszka
    Cc: Frederic Weisbecker
    Cc: rostedt@goodmis.org
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

30 Apr, 2013

4 commits

  • Pull SMP/hotplug changes from Ingo Molnar:
    "This is a pretty large, multi-arch series unifying and generalizing
    the various disjunct pieces of idle routines that architectures have
    historically copied from each other and have grown in random, wildly
    inconsistent and sometimes buggy directions:

    101 files changed, 455 insertions(+), 1328 deletions(-)

    this went through a number of review and test iterations before it was
    committed, it was tested on various architectures, was exposed to
    linux-next for quite some time - nevertheless it might cause problems
    on architectures that don't read the mailing lists and don't regularly
    test linux-next.

    This cat herding excercise was motivated by the -rt kernel, and was
    brought to you by Thomas "the Whip" Gleixner."

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    idle: Remove GENERIC_IDLE_LOOP config switch
    um: Use generic idle loop
    ia64: Make sure interrupts enabled when we "safe_halt()"
    sparc: Use generic idle loop
    idle: Remove unused ARCH_HAS_DEFAULT_IDLE
    bfin: Fix typo in arch_cpu_idle()
    xtensa: Use generic idle loop
    x86: Use generic idle loop
    unicore: Use generic idle loop
    tile: Use generic idle loop
    tile: Enter idle with preemption disabled
    sh: Use generic idle loop
    score: Use generic idle loop
    s390: Use generic idle loop
    powerpc: Use generic idle loop
    parisc: Use generic idle loop
    openrisc: Use generic idle loop
    mn10300: Use generic idle loop
    mips: Use generic idle loop
    microblaze: Use generic idle loop
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this development cycle were:

    - full dynticks preparatory work by Frederic Weisbecker

    - factor out the cpu time accounting code better, by Li Zefan

    - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

    - various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched: Fix init NOHZ_IDLE flag
    sched: Prevent to re-select dst-cpu in load_balance()
    sched: Rename load_balance_tmpmask to load_balance_mask
    sched: Move up affinity check to mitigate useless redoing overhead
    sched: Don't consider other cpus in our group in case of NEWLY_IDLE
    sched: Explicitly cpu_idle_type checking in rebalance_domains()
    sched: Change position of resched_cpu() in load_balance()
    sched: Fix wrong rq's runnable_avg update with rt tasks
    sched: Document task_struct::personality field
    sched/cpuacct/UML: Fix header file dependency bug on the UML build
    cgroup: Kill subsys.active flag
    sched/cpuacct: No need to check subsys active state
    sched/cpuacct: Initialize cpuacct subsystem earlier
    sched/cpuacct: Initialize root cpuacct earlier
    sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
    sched/cpuacct: Clean up cpuacct.h
    sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
    sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
    sched/cpuacct: Add cpuacct_acount_field()
    sched/cpuacct: Add cpuacct_init()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "A lot of activities on workqueue side this time. The changes achieve
    the followings.

    - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
    updated to be able to interface with multiple backend worker pools.
    This involved a lot of churning but the end result seems actually
    neater as unbound workqueues are now a lot closer to per-cpu ones.

    - The ability to interface with multiple backend worker pools are
    used to implement unbound workqueues with custom attributes.
    Currently the supported attributes are the nice level and CPU
    affinity. It may be expanded to include cgroup association in
    future. The attributes can be specified either by calling
    apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
    the workqueue in question is exported through sysfs.

    The backend worker pools are keyed by the actual attributes and
    shared by any workqueues which share the same attributes. When
    attributes of a workqueue are changed, the workqueue binds to the
    worker pool with the specified attributes while leaving the work
    items which are already executing in its previous worker pools
    alone.

    This allows converting custom worker pool implementations which
    want worker attribute tuning to use workqueues. The writeback pool
    is already converted in block tree and there are a couple others
    are likely to follow including btrfs io workers.

    - WQ_UNBOUND's ability to bind to multiple worker pools is also used
    to make it NUMA-aware. Because there's no association between work
    item issuer and the specific worker assigned to execute it, before
    this change, using unbound workqueue led to unnecessary cross-node
    bouncing and it couldn't be helped by autonuma as it requires tasks
    to have implicit node affinity and workers are assigned randomly.

    After these changes, an unbound workqueue now binds to multiple
    NUMA-affine worker pools so that queued work items are executed in
    the same node. This is turned on by default but can be disabled
    system-wide or for individual workqueues.

    Crypto was requesting NUMA affinity as encrypting data across
    different nodes can contribute noticeable overhead and doing it
    per-cpu was too limiting for certain cases and IO throughput could
    be bottlenecked by one CPU being fully occupied while others have
    idle cycles.

    While the new features required a lot of changes including
    restructuring locking, it didn't complicate the execution paths much.
    The unbound workqueue handling is now closer to per-cpu ones and the
    new features are implemented by simply associating a workqueue with
    different sets of backend worker pools without changing queue,
    execution or flush paths.

    As such, even though the amount of change is very high, I feel
    relatively safe in that it isn't likely to cause subtle issues with
    basic correctness of work item execution and handling. If something
    is wrong, it's likely to show up as being associated with worker pools
    with the wrong attributes or OOPS while workqueue attributes are being
    changed or during CPU hotplug.

    While this creates more backend worker pools, it doesn't add too many
    more workers unless, of course, there are many workqueues with unique
    combinations of attributes. Assuming everything else is the same,
    NUMA awareness costs an extra worker pool per NUMA node with online
    CPUs.

    There are also a couple things which are being routed outside the
    workqueue tree.

    - block tree pulled in workqueue for-3.10 so that writeback worker
    pool can be converted to unbound workqueue with sysfs control
    exposed. This simplifies the code, makes writeback workers
    NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

    - The conversion to workqueue means that there's no 1:1 association
    between a specific worker, which makes writeback folks unhappy as
    they want to be able to tell which filesystem caused a problem from
    backtrace on systems with many filesystems mounted. This is
    resolved by allowing work items to set debug info string which is
    printed when the task is dumped. As this change involves unifying
    implementations of dump_stack() and friends in arch codes, it's
    being routed through Andrew's -mm tree."

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
    workqueue: use kmem_cache_free() instead of kfree()
    workqueue: avoid false negative WARN_ON() in destroy_workqueue()
    workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
    workqueue: implement NUMA affinity for unbound workqueues
    workqueue: introduce put_pwq_unlocked()
    workqueue: introduce numa_pwq_tbl_install()
    workqueue: use NUMA-aware allocation for pool_workqueues
    workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
    workqueue: map an unbound workqueues to multiple per-node pool_workqueues
    workqueue: move hot fields of workqueue_struct to the end
    workqueue: make workqueue->name[] fixed len
    workqueue: add workqueue->unbound_attrs
    workqueue: determine NUMA node of workers accourding to the allowed cpumask
    workqueue: drop 'H' from kworker names of unbound worker pools
    workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
    workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
    workqueue: fix memory leak in apply_workqueue_attrs()
    workqueue: fix unbound workqueue attrs hashing / comparison
    workqueue: fix race condition in unbound workqueue free path
    workqueue: remove pwq_lock which is no longer used
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

29 Apr, 2013

1 commit

  • Pull locking changes from Ingo Molnar:
    "The most noticeable change are mutex speedups from Waiman Long, for
    higher loads. These scalability changes should be most noticeable on
    larger server systems.

    There are also cleanups, fixes and debuggability improvements."

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Consolidate bug messages into a single print_lockdep_off() function
    lockdep: Print out additional debugging advice when we hit lockdep BUGs
    mutex: Back out architecture specific check for negative mutex count
    mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
    mutex: Make more scalable by doing less atomic operations
    mutex: Move mutex spinning code from sched/core.c back to mutex.c
    locking/rtmutex/tester: Set correct permissions on sysfs files
    lockdep: Remove unnecessary 'hlock_next' variable

    Linus Torvalds
     

26 Apr, 2013

1 commit

  • On my SMP platform which is made of 5 cores in 2 clusters, I
    have the nr_busy_cpu field of sched_group_power struct that is
    not null when the platform is fully idle - which makes the
    scheduler unhappy.

    The root cause is:

    During the boot sequence, some CPUs reach the idle loop and set
    their NOHZ_IDLE flag while waiting for others CPUs to boot. But
    the nr_busy_cpus field is initialized later with the assumption
    that all CPUs are in the busy state whereas some CPUs have
    already set their NOHZ_IDLE flag.

    More generally, the NOHZ_IDLE flag must be initialized when new
    sched_domains are created in order to ensure that NOHZ_IDLE and
    nr_busy_cpus are aligned.

    This condition can be ensured by adding a synchronize_rcu()
    between the destruction of old sched_domains and the creation of
    new ones so the NOHZ_IDLE flag will not be updated with old
    sched_domain once it has been initialized. But this solution
    introduces a additionnal latency in the rebuild sequence that is
    called during cpu hotplug.

    As suggested by Frederic Weisbecker, another solution is to have
    the same rcu lifecycle for both NOHZ_IDLE and sched_domain
    struct. A new nohz_idle field is added to sched_domain so both
    status and sched_domain will share the same RCU lifecycle and
    will be always synchronized. In addition, there is no more need
    to protect nohz_idle against concurrent access as it is only
    modified by 2 exclusive functions called by local cpu.

    This solution has been prefered to the creation of a new struct
    with an extra pointer indirection for sched_domain.

    The synchronization is done at the cost of :

    - An additional indirection and a rcu_dereference for accessing nohz_idle.
    - We use only the nohz_idle field of the top sched_domain.

    Signed-off-by: Vincent Guittot
    Acked-by: Peter Zijlstra
    Cc: linaro-kernel@lists.linaro.org
    Cc: peterz@infradead.org
    Cc: fweisbec@gmail.com
    Cc: pjt@google.com
    Cc: rostedt@goodmis.org
    Cc: efault@gmx.de
    Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
    [ Fixed !NO_HZ build bug. ]
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

24 Apr, 2013

6 commits

  • Commit 88b8dac0 makes load_balance() consider other cpus in its
    group. But, in that, there is no code for preventing to
    re-select dst-cpu. So, same dst-cpu can be selected over and
    over.

    This patch add functionality to load_balance() in order to
    exclude cpu which is selected once. We prevent to re-select
    dst_cpu via env's cpus, so now, env's cpus is a candidate not
    only for src_cpus, but also dst_cpus.

    With this patch, we can remove lb_iterations and
    max_lb_iterations, because we decide whether we can go ahead or
    not via env's cpus.

    Signed-off-by: Joonsoo Kim
    Acked-by: Peter Zijlstra
    Tested-by: Jason Low
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     
  • This name doesn't represent specific meaning.
    So rename it to imply it's purpose.

    Signed-off-by: Joonsoo Kim
    Acked-by: Peter Zijlstra
    Tested-by: Jason Low
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     
  • Currently, LBF_ALL_PINNED is cleared after affinity check is
    passed. So, if task migration is skipped by small load value or
    small imbalance value in move_tasks(), we don't clear
    LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance().

    Imbalance value is often so small that any tasks cannot be moved
    to other cpus and, of course, this situation may be continued
    after we change the target cpu. So this patch move up affinity
    check code and clear LBF_ALL_PINNED before evaluating load value
    in order to mitigate useless redoing overhead.

    In addition, re-order some comments correctly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Peter Zijlstra
    Tested-by: Jason Low
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     
  • Commit 88b8dac0 makes load_balance() consider other cpus in its
    group, regardless of idle type. When we do NEWLY_IDLE balancing,
    we should not consider it, because a motivation of NEWLY_IDLE
    balancing is to turn this cpu to non idle state if needed. This
    is not the case of other cpus. So, change code not to consider
    other cpus for NEWLY_IDLE balancing.

    With this patch, assign 'if (pulled_task) this_rq->idle_stamp =
    0' in idle_balance() is corrected, because NEWLY_IDLE balancing
    doesn't consider other cpus. Assigning to 'this_rq->idle_stamp'
    is now valid.

    Signed-off-by: Joonsoo Kim
    Tested-by: Jason Low
    Acked-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     
  • After commit 88b8dac0, dst-cpu can be changed in load_balance(),
    then we can't know cpu_idle_type of dst-cpu when load_balance()
    return positive. So, add explicit cpu_idle_type checking.

    Signed-off-by: Joonsoo Kim
    Tested-by: Jason Low
    Acked-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     
  • cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK.
    So, there is possibility that we miss doing resched_cpu().
    Correct it as changing position of resched_cpu()
    before checking LBF_NEED_BREAK.

    Signed-off-by: Joonsoo Kim
    Tested-by: Jason Low
    Acked-by: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Ingo Molnar

    Joonsoo Kim
     

23 Apr, 2013

4 commits

  • When a task is scheduled in, it may have some properties
    of its own that could make the CPU reconsider the need for
    the tick: posix cpu timers, perf events, ...

    So notify the full dynticks subsystem when a task gets
    scheduled in and re-check the tick dependency at this
    stage. This is done through a self IPI to avoid messing
    up with any current lock scenario.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The scheduler IPI is used by the scheduler to kick
    full dynticks CPUs asynchronously when more than one
    task are running or when a new timer list timer is
    enqueued. This way the destination CPU can decide
    to restart the tick to handle this new situation.

    Now let's call that kick in the scheduler IPI.

    (Reusing the scheduler IPI rather than implementing
    a new IPI was suggested by Peter Zijlstra a while ago)

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Provide a new helper to be called from the full dynticks engine
    before stopping the tick in order to make sure we don't stop
    it when there is more than one task running on the CPU.

    This way we make sure that the tick stays alive to maintain
    fairness.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Kick the tick on full dynticks CPUs when they get more
    than one task running on their queue. This makes sure that
    local fairness is maintained by the tick on the destination.

    This is done regardless of these tasks' class. We should
    be able to be more clever in the future depending on these. eg:
    a CPU that runs a SCHED_FIFO task doesn't need to maintain
    fairness against local pending tasks of the fair class.

    But keep things simple for now.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

21 Apr, 2013

1 commit

  • The current update of the rq's load can be erroneous when RT
    tasks are involved.

    The update of the load of a rq that becomes idle, is done only
    if the avg_idle is less than sysctl_sched_migration_cost. If RT
    tasks and short idle duration alternate, the runnable_avg will
    not be updated correctly and the time will be accounted as idle
    time when a CFS task wakes up.

    A new idle_enter function is called when the next task is the
    idle function so the elapsed time will be accounted as run time
    in the load of the rq, whatever the average idle time is. The
    function update_rq_runnable_avg is removed from idle_balance.

    When a RT task is scheduled on an idle CPU, the update of the
    rq's load is not done when the rq exit idle state because CFS's
    functions are not called. Then, the idle_balance, which is
    called just before entering the idle function, updates the rq's
    load and makes the assumption that the elapsed time since the
    last update, was only running time.

    As a consequence, the rq's load of a CPU that only runs a
    periodic RT task, is close to LOAD_AVG_MAX whatever the running
    duration of the RT task is.

    A new idle_exit function is called when the prev task is the
    idle function so the elapsed time will be accounted as idle time
    in the rq's load.

    Signed-off-by: Vincent Guittot
    Acked-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Cc: linaro-kernel@lists.linaro.org
    Cc: peterz@infradead.org
    Cc: pjt@google.com
    Cc: fweisbec@gmail.com
    Cc: efault@gmx.de
    Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

19 Apr, 2013

1 commit

  • As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
    feature bit was really just an early hack to make with/without
    mutex-spinning testable. So it is no longer necessary.

    This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
    move the mutex spinning code from kernel/sched/core.c back to
    kernel/mutex.c which is where they should belong.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

16 Apr, 2013

1 commit

  • "Extended nohz" was used as a naming base for the full dynticks
    API and Kconfig symbols. It reflects the fact the system tries
    to stop the tick in more places than just idle.

    But that "extended" name is a bit opaque and vague. Rename it to
    "full" makes it clearer what the system tries to do under this
    config: try to shutdown the tick anytime it can. The various
    constraints that prevent that to happen shouldn't be considered
    as fundamental properties of this feature but rather technical
    issues that may be solved in the future.

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

15 Apr, 2013

1 commit


10 Apr, 2013

5 commits