15 Jan, 2012

1 commit

  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
    capabilities: remove __cap_full_set definition
    security: remove the security_netlink_recv hook as it is equivalent to capable()
    ptrace: do not audit capability check when outputing /proc/pid/stat
    capabilities: remove task_ns_* functions
    capabitlies: ns_capable can use the cap helpers rather than lsm call
    capabilities: style only - move capable below ns_capable
    capabilites: introduce new has_ns_capabilities_noaudit
    capabilities: call has_ns_capability from has_capability
    capabilities: remove all _real_ interfaces
    capabilities: introduce security_capable_noaudit
    capabilities: reverse arguments to security_capable
    capabilities: remove the task from capable LSM hook entirely
    selinux: sparse fix: fix several warnings in the security server cod
    selinux: sparse fix: fix warnings in netlink code
    selinux: sparse fix: eliminate warnings for selinuxfs
    selinux: sparse fix: declare selinux_disable() in security.h
    selinux: sparse fix: move selinux_complete_init
    selinux: sparse fix: make selinux_secmark_refcount static
    SELinux: Fix RCU deref check warning in sel_netport_insert()

    Manually fix up a semantic mis-merge wrt security_netlink_recv():

    - the interface was removed in commit fd7784615248 ("security: remove
    the security_netlink_recv hook as it is equivalent to capable()")

    - a new user of it appeared in commit a38f7907b926 ("crypto: Add
    userspace configuration API")

    causing no automatic merge conflict, but Eric Paris pointed out the
    issue.

    Linus Torvalds
     

12 Jan, 2012

2 commits

  • * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix lockup by limiting load-balance retries on lock-break
    sched: Fix CONFIG_CGROUP_SCHED dependency
    sched: Remove empty #ifdefs

    Linus Torvalds
     
  • Eric and David reported dead machines and traced it to commit
    a195f004 ("sched: Fix load-balance lock-breaking"), it turns out
    there's still a scenario where we can end up re-trying forever.

    Since there is no strict forward progress guarantee in the
    load-balance iteration we can get stuck re-retrying the same
    task-set over and over.

    Creating a forward progress guarantee with the existing
    structure is somewhat non-trivial, for now simply terminate the
    retry loop after a few tries.

    Reported-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Reported-by: David Ahern
    [ logic cleanup as suggested by Eric ]
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Frederic Weisbecker
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/r/1326297936.2442.157.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Jan, 2012

2 commits

  • Signed-off-by: Hiroshi Shimamoto
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4F0B8525.8070901@ct.jp.nec.com
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     
  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

08 Jan, 2012

1 commit

  • * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
    arm: fix up some samsung merge sysdev conversion problems
    firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
    Drivers:hv: Fix a bug in vmbus_driver_unregister()
    driver core: remove __must_check from device_create_file
    debugfs: add missing #ifdef HAS_IOMEM
    arm: time.h: remove device.h #include
    driver-core: remove sysdev.h usage.
    clockevents: remove sysdev.h
    arm: convert sysdev_class to a regular subsystem
    arm: leds: convert sysdev_class to a regular subsystem
    kobject: remove kset_find_obj_hinted()
    m86k: gpio - convert sysdev_class to a regular subsystem
    mips: txx9_sram - convert sysdev_class to a regular subsystem
    mips: 7segled - convert sysdev_class to a regular subsystem
    sh: dma - convert sysdev_class to a regular subsystem
    sh: intc - convert sysdev_class to a regular subsystem
    power: suspend - convert sysdev_class to a regular subsystem
    power: qe_ic - convert sysdev_class to a regular subsystem
    power: cmm - convert sysdev_class to a regular subsystem
    s390: time - convert sysdev_class to a regular subsystem
    ...

    Fix up conflicts with 'struct sysdev' removal from various platform
    drivers that got changed:
    - arch/arm/mach-exynos/cpu.c
    - arch/arm/mach-exynos/irq-eint.c
    - arch/arm/mach-s3c64xx/common.c
    - arch/arm/mach-s3c64xx/cpu.c
    - arch/arm/mach-s5p64x0/cpu.c
    - arch/arm/mach-s5pv210/common.c
    - arch/arm/plat-samsung/include/plat/cpu.h
    - arch/powerpc/kernel/sysfs.c
    and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h

    Linus Torvalds
     

24 Dec, 2011

1 commit

  • If CONFIG_SCHEDSTATS is defined, the kernel maintains
    information about how long the task was sleeping or
    in the case of iowait, blocking in the kernel before
    getting woken up.

    This will be useful for sleep time profiling.

    Note: this information is only provided for sched_fair.
    Other scheduling classes may choose to provide this in
    the future.

    Note: the delay includes the time spent on the runqueue
    as well.

    Signed-off-by: Arun Sharma
    Acked-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Arnaldo Carvalho de Melo
    Cc: Andrew Vagin
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1324512940-32060-2-git-send-email-asharma@fb.com
    Signed-off-by: Ingo Molnar

    Arun Sharma
     

23 Dec, 2011

1 commit

  • The panic-on-framebuffer code seems to cause a schedule
    to occur during an oops. This causes a bunch of extra
    spew as can be seen in:

    https://bugzilla.redhat.com/attachment.cgi?id=549230

    Don't do scheduler debug checks when we are oopsing already.

    Signed-off-by: Dave Jones
    Link: http://lkml.kernel.org/r/20111222213929.GA4722@redhat.com
    Signed-off-by: Ingo Molnar

    Dave Jones
     

21 Dec, 2011

7 commits

  • There is a small race between try_to_wake_up() and sched_move_task(),
    which is trying to move the process being woken up.

    try_to_wake_up() on CPU0 sched_move_task() on CPU1
    --------------------------------+---------------------------------
    raw_spin_lock_irqsave(p->pi_lock)
    task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
    ttwu_queue()
    ->send reschedule IPI to CPU1
    raw_spin_unlock_irqsave(p->pi_lock)
    task_rq_lock()
    -> tring to aquire both p->pi_lock and
    rq->lock with IRQ disabled
    task_move_group_fair()
    -> p.se.vruntime
    -= (old)cfs_rq->min_vruntime
    += (new)cfs_rq->min_vruntime
    task_rq_unlock()

    (via IPI)
    sched_ttwu_pending()
    raw_spin_lock(rq->lock)
    ttwu_do_activate()
    ...
    enqueue_entity()
    child.se->vruntime += cfs_rq->min_vruntime
    raw_spin_unlock(rq->lock)

    As a result, vruntime of the process becomes far bigger than min_vruntime,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by just ignoring such process in
    task_move_group_fair(), because the vruntime has already been normalized in
    task_waking_fair().

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • There is a small race between do_fork() and sched_move_task(), which is
    trying to move the child.

    do_fork() sched_move_task()
    --------------------------------+---------------------------------
    copy_process()
    sched_fork()
    task_fork_fair()
    -> vruntime of the child is initialized
    based on that of the parent.
    -> we can see the child in "tasks" file now.
    task_rq_lock()
    task_move_group_fair()
    -> child.se.vruntime
    -= (old)cfs_rq->min_vruntime
    += (new)cfs_rq->min_vruntime
    task_rq_unlock()
    wake_up_new_task()
    ...
    enqueue_entity()
    child.se.vruntime += cfs_rq->min_vruntime

    As a result, vruntime of the child becomes far bigger than min_vruntime,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by just ignoring such process in
    task_move_group_fair(), because the vruntime has already been normalized in
    task_fork_fair().

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • There is a small race between task_fork_fair() and sched_move_task(),
    which is trying to move the parent.

    task_fork_fair() sched_move_task()
    --------------------------------+---------------------------------
    cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
    curr = cfs_rq->curr
    -> curr is set to the parent.
    task_rq_lock()
    dequeue_task()
    ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
    enqueue_task()
    ->parent.se.vruntime += (new)cfs_rq->min_vruntime
    task_rq_unlock()
    raw_spin_lock_irqsave(rq->lock)
    se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
    which has already been updated by sched_move_task().
    se->vruntime -= (old)cfs_rq->min_vruntime.
    raw_spin_unlock_irqrestore(rq->lock)

    As a result, vruntime of the child becomes far bigger than expected,
    if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

    This patch fixes this problem by setting "cfs_rq" and "curr" after
    holding the rq->lock.

    Signed-off-by: Daisuke Nishimura
    Acked-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     
  • Remove cfs bandwidth period check from tg_set_cfs_period.
    Invalid bandwidth period's lower/upper limits are denoted
    by min_cfs_quota_period/max_cfs_quota_period repsectively,
    and are checked against valid period in tg_set_cfs_bandwidth().

    As pjt pointed out, negative input will result in very large unsigned
    numbers and will be caught by the max allowed period test.

    Signed-off-by: Kamalesh Babulal
    Acked-by: Paul Turner
    [ammended changelog to mention negative values]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com
    --
    kernel/sched/core.c | 3 ---
    1 file changed, 3 deletions(-)

    Signed-off-by: Ingo Molnar

    Kamalesh Babulal
     
  • The current lock break relies on contention on the rq locks, something
    which might never come because we've got IRQs disabled. Or will be
    very likely because on anything with more than 2 cpus a synchronized
    load-balance pass will very likely cause contention on the rq locks.

    Also the sched_nr_migrate thing fails when it gets trapped the loops
    of either the cgroup muck in load_balance_fair() or the move_tasks()
    load condition.

    Instead, use the new lb_flags field to propagate break/abort
    conditions for all these loops and create a new loop outside the irq
    disabled on the break being required.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Replace the all_pinned argument with a flags field so that we can add
    some extra controls throughout that entire call chain.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Mike reported a 13% drop in netperf TCP_RR performance due to the
    new remote wakeup code. Suresh too noticed some performance issues
    with it.

    Reducing the IPIs to only cross cache domains solves the observed
    performance issues.

    Reported-by: Suresh Siddha
    Reported-by: Mike Galbraith
    Acked-by: Suresh Siddha
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Chris Mason
    Cc: Dave Kleikamp
    Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Dec, 2011

1 commit


16 Dec, 2011

1 commit


08 Dec, 2011

1 commit

  • Yong Zhang reported:

    > [ INFO: suspicious RCU usage. ]
    > kernel/sched/fair.c:5091 suspicious rcu_dereference_check() usage!

    This is due to the sched_domain stuff being RCU protected and
    commit 0b005cf5 ("sched, nohz: Implement sched group, domain
    aware nohz idle load balancing") overlooking this fact.

    The sd variable only lives inside the for_each_domain() block,
    so we only need to wrap that.

    Reported-by: Yong Zhang
    Tested-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/r/1323264728.32012.107.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Dec, 2011

7 commits

  • Intention is to set the NOHZ_BALANCE_KICK flag for the 'ilb_cpu'. Not
    for the 'cpu' which is the local cpu. Fix the typo.

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1323199594.1984.18.camel@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after
    exiting idle. So during nohz_idle_balance(), intention is to double
    check if the cpu that is part of the idle_cpu_mask is indeed idle before
    going ahead in performing idle balance for that cpu.

    Fix the cpu typo in the idle_cpu() check during nohz_idle_balance().

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1323199177.1984.12.camel@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Now that we initialize jump_labels before sched_init() we can use them
    for the debug features without having to worry about a window where
    they have the wrong setting.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The order of parameters is inverted. The index parameter
    should come first.

    Signed-off-by: Glauber Costa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322863119-14225-3-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     
  • Now that we're pointing cpuacct's root cgroup to cpustat and accounting
    through task_group_account_field(), we should not access cpustat directly.
    Since it is done anyway inside the acessor function, we end up accounting
    it twice, which is wrong.

    Signed-off-by: Glauber Costa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322863119-14225-2-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     
  • Right now, after we collect tick statistics for user and system and store them
    in a well known location, we keep the same statistics again for cpuacct.
    Since cpuacct is hierarchical, the numbers for the root cgroup should be
    absolutely equal to the system-wide numbers.

    So it would be better to just use it: this patch changes cpuacct accounting
    in a way that the cpustat statistics are kept in a struct kernel_cpustat percpu
    array. In the root cgroup case, we just point it to the main array. The rest of
    the hierarchy walk can be totally disabled later with a static branch - but I am
    not doing it here.

    Signed-off-by: Glauber Costa
    Signed-off-by: Peter Zijlstra
    Cc: Paul Tuner
    Link: http://lkml.kernel.org/r/1322498719-2255-4-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     
  • hrtick_start_fair() shows up in profiles even when disabled.

    v3.0.6

    taskset -c 3 pipe-test

    PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3)
    ------------------------------------------------------------------------------------------------

    Virgin Patched
    samples pcnt function samples pcnt function
    _______ _____ ___________________________ _______ _____ ___________________________

    2880.00 10.2% __schedule 3136.00 11.3% __schedule
    1634.00 5.8% pipe_read 1615.00 5.8% pipe_read
    1458.00 5.2% system_call 1534.00 5.5% system_call
    1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave
    1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string
    1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to
    1097.00 3.9% __switch_to 929.00 3.3% mutex_lock
    872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock
    687.00 2.4% mutex_unlock 804.00 2.9% pipe_write
    682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock
    643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore
    617.00 2.2% sched_clock_local 633.00 2.3% fsnotify
    612.00 2.2% fsnotify 605.00 2.2% sched_clock_local
    596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs
    542.00 1.9% sysret_check 559.00 2.0% sysret_check
    467.00 1.7% fget_light 472.00 1.7% fget_light
    462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch
    437.00 1.5% vfs_write 442.00 1.6% vfs_write
    431.00 1.5% do_sync_write 428.00 1.5% do_sync_write
    413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq
    386.00 1.4% update_curr 402.00 1.4% update_curr
    385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read
    377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read
    369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user
    360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key
    * 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common

    Signed-off-by: Mike Galbraith
    [ fixed !CONFIG_SCHED_HRTICK borkage ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1321971607.6855.17.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

06 Dec, 2011

12 commits

  • We already have a pointer to the cgroup parent (whose data is more likely
    to be in the cache than this, anyway), so there is no need to have this one
    in cpuacct.

    This patch makes the underlying cgroup be used instead.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Paul Tuner
    Cc: Li Zefan
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322498719-2255-3-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     
  • This patch changes fields in cpustat from a structure, to an
    u64 array. Math gets easier, and the code is more flexible.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Paul Tuner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322498719-2255-2-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     
  • nr_busy_cpus in the sched_group_power indicates whether the group
    is semi idle or not. This helps remove the is_semi_idle_group() and simplify
    the find_new_ilb() in the context of finding an optimal cpu that can do
    idle load balancing.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20111202010832.656983582@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • When there are many logical cpu's that enter and exit idle often, members of
    the global nohz data structure are getting modified very frequently causing
    lot of cache-line contention.

    Make the nohz idle load balancing more scalabale by using the sched domain
    topology and 'nr_busy_cpu's in the struct sched_group_power.

    Idle load balance is kicked on one of the idle cpu's when there is atleast
    one idle cpu and:

    - a busy rq having more than one task or

    - a busy rq's scheduler group that share package resources (like HT/MC
    siblings) and has more than one member in that group busy or

    - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that
    domain are idle compared to the busy ones.

    This will help in kicking the idle load balancing request only when
    there is a potential imbalance. And once it is mostly balanced, these kicks will
    be minimized.

    These changes helped improve the workload that is context switch intensive
    between number of task pairs by 2x on a 8 socket NHM-EX based system.

    Reported-by: Tim Chen
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group
    because sched groups are duplicated for the SD_OVERLAP scheduler domain]
    and for each cpu that enters and exits idle, this parameter will
    be updated in each scheduler group of the scheduler domain that this cpu
    belongs to.

    To avoid the frequent update of this state as the cpu enters
    and exits idle, the update of the stat during idle exit is
    delayed to the first timer tick that happens after the cpu becomes busy.
    This is done using NOHZ_IDLE flag in the struct rq's nohz_flags.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Introduce nohz_flags in the struct rq, which will track these two flags
    for now.

    NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when
    the tick is stopped. It will be used to update the nohz idle load balancer data
    structures during the first busy tick after the tick is restarted. At this
    first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset.
    This will minimize the nohz idle load balancer status updates that currently
    happen for every tickless exit, making it more scalable when there
    are many logical cpu's that enter and exit idle often.

    NOHZ_BALANCE_KICK will track the need for nohz idle load balance
    on this rq. This will replace the nohz_balance_kick in the rq, which was
    not being updated atomically.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • The second call to sched_rt_period() is redundant, because the value of the
    rt_runtime was already read and it was protected by the ->rt_runtime_lock.

    Signed-off-by: Shan Hai
    Reviewed-by: Kamalesh Babulal
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322535836-13590-2-git-send-email-haishan.bai@gmail.com
    Signed-off-by: Ingo Molnar

    Shan Hai
     
  • For the SD_OVERLAP domain, sched_groups for each CPU's sched_domain are
    privately allocated and not shared with any other cpu. So the
    sched group allocation should come from the cpu's node for which
    SD_OVERLAP sched domain is being setup.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111118230554.164910950@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • This is another case where we are on our way to schedule(),
    so can save a useless clock update and resulting microscopic
    vruntime update.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1321971686.6855.18.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
    when only one cpu can be used, and saves some cycles for pinned tasks.

    See the line marked with '*' below:

    # taskset -c 3 pipe-test

    PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3)
    ------------------------------------------------------------------------------------------------

    Virgin Patched
    samples pcnt function samples pcnt function
    _______ _____ ___________________________ _______ _____ ___________________________

    2880.00 10.2% __schedule 3136.00 11.3% __schedule
    1634.00 5.8% pipe_read 1615.00 5.8% pipe_read
    1458.00 5.2% system_call 1534.00 5.5% system_call
    1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave
    1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string
    1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to
    1097.00 3.9% __switch_to 929.00 3.3% mutex_lock
    872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock
    687.00 2.4% mutex_unlock 804.00 2.9% pipe_write
    682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock
    643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore
    617.00 2.2% sched_clock_local 633.00 2.3% fsnotify
    612.00 2.2% fsnotify 605.00 2.2% sched_clock_local
    596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs
    542.00 1.9% sysret_check 559.00 2.0% sysret_check
    467.00 1.7% fget_light 472.00 1.7% fget_light
    462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch
    437.00 1.5% vfs_write 442.00 1.6% vfs_write
    431.00 1.5% do_sync_write 428.00 1.5% do_sync_write
    * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq
    386.00 1.4% update_curr 402.00 1.4% update_curr
    385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read
    377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read
    369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user
    360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key
    342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Instead of going through the scheduler domain hierarchy multiple times
    (for giving priority to an idle core over an idle SMT sibling in a busy
    core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
    flag and traverse the domain hierarchy down till we find an idle group.

    This cleanup also addresses an issue reported by Mike where the recent
    changes returned the busy thread even in the presence of an idle SMT
    sibling in single socket platforms.

    Signed-off-by: Suresh Siddha
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • This tracepoint shows how long a task is sleeping in uninterruptible state.

    E.g. it may show how long and where a mutex is waited for.

    Signed-off-by: Andrew Vagin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322471015-107825-8-git-send-email-avagin@openvz.org
    Signed-off-by: Ingo Molnar

    Andrew Vagin
     

17 Nov, 2011

1 commit