20 Apr, 2014

1 commit


17 Apr, 2014

1 commit


11 Apr, 2014

1 commit

  • Sasha reported that lockdep claims that the following commit:
    made numa_group.lock interrupt unsafe:

    156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")

    While I don't see how that could be, given the commit in question moved
    task_numa_free() from one irq enabled region to another, the below does
    make both gripes and lockups upon gripe with numa=fake=4 go away.

    Reported-by: Sasha Levin
    Fixes: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: torvalds@linux-foundation.org
    Cc: mgorman@suse.com
    Cc: akpm@linux-foundation.org
    Cc: Dave Jones
    Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

08 Apr, 2014

3 commits

  • Merge second patch-bomb from Andrew Morton:
    - the rest of MM
    - zram updates
    - zswap updates
    - exit
    - procfs
    - exec
    - wait
    - crash dump
    - lib/idr
    - rapidio
    - adfs, affs, bfs, ufs
    - cris
    - Kconfig things
    - initramfs
    - small amount of IPC material
    - percpu enhancements
    - early ioremap support
    - various other misc things

    * emailed patches from Andrew Morton : (156 commits)
    MAINTAINERS: update Intel C600 SAS driver maintainers
    fs/ufs: remove unused ufs_super_block_third pointer
    fs/ufs: remove unused ufs_super_block_second pointer
    fs/ufs: remove unused ufs_super_block_first pointer
    fs/ufs/super.c: add __init to init_inodecache()
    doc/kernel-parameters.txt: add early_ioremap_debug
    arm64: add early_ioremap support
    arm64: initialize pgprot info earlier in boot
    x86: use generic early_ioremap
    mm: create generic early_ioremap() support
    x86/mm: sparse warning fix for early_memremap
    lglock: map to spinlock when !CONFIG_SMP
    percpu: add preemption checks to __this_cpu ops
    vmstat: use raw_cpu_ops to avoid false positives on preemption checks
    slub: use raw_cpu_inc for incrementing statistics
    net: replace __this_cpu_inc in route.c with raw_cpu_inc
    modules: use raw_cpu_write for initialization of per cpu refcount.
    mm: use raw_cpu ops for determining current NUMA node
    percpu: add raw_cpu_ops
    slub: fix leak of 'name' in sysfs_slab_add
    ...

    Linus Torvalds
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes
    with the right macro in the kernel subsystem.

    Signed-off-by: Gideon Israel Dsouza
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • This is the final piece in the puzzle, as all patches to remove the
    last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
    into the 3.15 merge window. The work was long overdue, and this
    interface in particular should not have survived the BKL removal
    that was done a couple of years ago.

    Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

    "[...] it was suggested that the janitors look for and fix all code
    that calls sleep_on() [...] since (1) almost all such code is
    incorrect, and (2) Linus has agreed that those functions should
    be removed in the 2.5 development series".

    We haven't quite made it for 2.5, but maybe we can merge this for 3.15.

    Signed-off-by: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

04 Apr, 2014

3 commits

  • Merge first patch-bomb from Andrew Morton:
    - Various misc bits
    - kmemleak fixes
    - small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
    - fanotify
    - I appear to have become SuperH maintainer
    - ocfs2 updates
    - direct-io tweaks
    - a bit of the MM queue
    - printk updates
    - MAINTAINERS maintenance
    - some backlight things
    - lib/ updates
    - checkpatch updates
    - the rtc queue
    - nilfs2 updates
    - Small Documentation/ updates

    * emailed patches from Andrew Morton : (237 commits)
    Documentation/SubmittingPatches: remove references to patch-scripts
    Documentation/SubmittingPatches: update some dead URLs
    Documentation/filesystems/ntfs.txt: remove changelog reference
    Documentation/kmemleak.txt: updates
    fs/reiserfs/super.c: add __init to init_inodecache
    fs/reiserfs: move prototype declaration to header file
    fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
    fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
    fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
    nilfs2: update project's web site in nilfs2.txt
    nilfs2: update MAINTAINERS file entries fix
    nilfs2: verify metadata sizes read from disk
    nilfs2: add FITRIM ioctl support for nilfs2
    nilfs2: add nilfs_sufile_trim_fs to trim clean segs
    nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
    nilfs2: add nilfs_sufile_set_suinfo to update segment usage
    nilfs2: add struct nilfs_suinfo_update and flags
    nilfs2: update MAINTAINERS file entries
    fs/coda/inode.c: add __init to init_inodecache()
    BEFS: logging cleanup
    ...

    Linus Torvalds
     
  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    kernel/user.c obj-y
    kernel/kexec.c bool KEXEC (one instance per arch)
    kernel/profile.c bool PROFILING
    kernel/hung_task.c bool DETECT_HUNG_TASK
    kernel/sched/stats.c bool SCHEDSTATS
    kernel/user_namespace.c bool USER_NS

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier). However no observable impact of that
    difference has been observed during testing.

    Also, two instances of missing ";" at EOL are fixed in kexec.

    Signed-off-by: Paul Gortmaker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

03 Apr, 2014

1 commit

  • Pull sched/idle changes from Ingo Molnar:
    "More idle code reorganization, to prepare for more integration.

    (Sent separately because it depended on pending timer work, which is
    now upstream)"

    * 'sched-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/idle: Add more comments to the code
    sched/idle: Move idle conditions in cpuidle_idle main function
    sched/idle: Reorganize the idle loop
    cpuidle/idle: Move the cpuidle_idle_call function to idle.c
    idle/cpuidle: Split cpuidle_idle_call main function into smaller functions

    Linus Torvalds
     

02 Apr, 2014

3 commits

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     
  • Pull timer changes from Thomas Gleixner:
    "This assorted collection provides:

    - A new timer based timer broadcast feature for systems which do not
    provide a global accessible timer device. That allows those
    systems to put CPUs into deep idle states where the per cpu timer
    device stops.

    - A few NOHZ_FULL related improvements to the timer wheel

    - The usual updates to timer devices found in ARM SoCs

    - Small improvements and updates all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    tick: Remove code duplication in tick_handle_periodic()
    tick: Fix spelling mistake in tick_handle_periodic()
    x86: hpet: Use proper destructor for delayed work
    workqueue: Provide destroy_delayed_work_on_stack()
    clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
    timer: Remove code redundancy while calling get_nohz_timer_target()
    hrtimer: Rearrange comments in the order struct members are declared
    timer: Use variable head instead of &work_list in __run_timers()
    clocksource: exynos_mct: silence a static checker warning
    arm: zynq: Add support for cpufreq
    arm: zynq: Don't use arm_global_timer with cpufreq
    clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
    clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
    clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
    sh: Remove Kconfig entries for TMU, CMT and MTU2
    ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
    clocksource: armada-370-xp: Use atomic access for shared registers
    clocksource: orion: Use atomic access for shared registers
    clocksource: timer-keystone: Delete unnecessary variable
    clocksource: timer-keystone: introduce clocksource driver for Keystone
    ...

    Linus Torvalds
     
  • Pull timer updates from Ingo Molnar:
    "The main purpose is to fix a full dynticks bug related to
    virtualization, where steal time accounting appears to be zero in
    /proc/stat even after a few seconds of competing guests running busy
    loops in a same host CPU. It's not a regression though as it was
    there since the beginning.

    The other commits are preparatory work to fix the bug and various
    cleanups"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arch: Remove stub cputime.h headers
    sched: Remove needless round trip nsecs tick conversion of steal time
    cputime: Fix jiffies based cputime assumption on steal accounting
    cputime: Bring cputime -> nsecs conversion
    cputime: Default implementation of nsecs -> cputime conversion
    cputime: Fix nsecs_to_cputime() return type cast

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "There are two memory management related changes, the CMMA support for
    KVM to avoid swap-in of freed pages and the split page table lock for
    the PMD level. These two come with common code changes in mm/.

    A fix for the long standing theoretical TLB flush problem, this one
    comes with a common code change in kernel/sched/.

    Another set of changes is Heikos uaccess work, included is the initial
    set of patches with more to come.

    And fixes and cleanups as usual"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
    s390/con3270: optionally disable auto update
    s390/mm: remove unecessary parameter from pgste_ipte_notify
    s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
    s390/mm: fixing comment so that parameter name match
    s390/smp: limit number of cpus in possible cpu mask
    hypfs: Add clarification for "weight_min" attribute
    s390: update defconfigs
    s390/ptrace: add support for PTRACE_SINGLEBLOCK
    s390/perf: make print_debug_cf() static
    s390/topology: Remove call to update_cpu_masks()
    s390/compat: remove compat exec domain
    s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
    s390/appldata_os: fix cpu array size calculation
    s390/checksum: remove memset() within csum_partial_copy_from_user()
    s390/uaccess: remove copy_from_user_real()
    s390/sclp_early: Return correct HSA block count also for zero
    s390: add some drivers/subsystems to the MAINTAINERS file
    s390: improve debug feature usage
    s390/airq: add support for irq ranges
    s390/mm: enable split page table lock for PMD level
    ...

    Linus Torvalds
     

20 Mar, 2014

1 commit

  • There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
    call it under same circumstances, i.e.

    #ifdef CONFIG_NO_HZ_COMMON
    if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
    return get_nohz_timer_target();
    #endif

    So, it makes more sense to get all this as part of get_nohz_timer_target()
    instead of duplicating code at two places. For this another parameter is
    required to be passed to this routine, pinned.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

13 Mar, 2014

2 commits

  • When update_rq_clock_task() accounts the pending steal time for a task,
    it converts the steal delta from nsecs to tick then from tick to nsecs.

    There is no apparent good reason for doing that though because both
    the task clock and the prev steal delta are u64 and store values
    in nsecs.

    So lets remove the needless conversion.

    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Acked-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • The steal guest time accounting code assumes that cputime_t is based on
    jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
    is based on nsecs, steal_account_process_tick() passes the delta in
    jiffies to account_steal_time() which then accounts it as if it's a
    value in nsecs.

    As a result, accounting 1 second of steal time (with HZ=100 that would
    be 100 jiffies) is spuriously accounted as 100 nsecs.

    As such /proc/stat may report 0 values of steal time even when two
    guests have run concurrently for a few seconds on the same host and
    same CPU.

    In order to fix this, lets convert the nsecs based steal delta to
    cputime instead of jiffies by using the right conversion API.

    Given that the steal time is stored in cputime_t and this type can have
    a smaller granularity than nsecs, we only account the rounded converted
    value and leave the remaining nsecs for the next deltas.

    Reported-by: Huiqingding
    Reported-by: Marcelo Tosatti
    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Acked-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

12 Mar, 2014

3 commits

  • task_hot() doesn't need the 'sched_domain' parameter, so remove it.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1394607111-1904-1-git-send-email-alex.shi@linaro.org
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • The tmp value has been already calculated in:

    scaled_busy_load_per_task =
    (busiest->load_per_task * SCHED_POWER_SCALE) /
    busiest->group_power;

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1394555166-22894-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • I decided to run my tests on linux-next, and my wakeup_rt tracer was
    broken. After running a bisect, I found that the problem commit was:

    linux-next commit c365c292d059
    "sched: Consider pi boosting in setscheduler()"

    And the reason the wake_rt tracer test was failing, was because it had
    no RT task to trace. I first noticed this when running with
    sched_switch event and saw that my RT task still had normal SCHED_OTHER
    priority. Looking at the problem commit, I found:

    - p->normal_prio = normal_prio(p);
    - p->prio = rt_mutex_getprio(p);

    With no

    + p->normal_prio = normal_prio(p);
    + p->prio = rt_mutex_getprio(p);

    Reading what the commit is suppose to do, I realize that the p->prio
    can't be set if the task is boosted with a higher prio, but the
    p->normal_prio still needs to be set regardless, otherwise, when the
    task is deboosted, it wont get the new priority.

    The p->prio has to be set before "check_class_changed()" is called,
    otherwise the class wont be changed.

    Also added fix to newprio to include a check for deadline policy that
    was missing. This change was suggested by Juri Lelli.

    Signed-off-by: Steven Rostedt
    Cc: SebastianAndrzej Siewior
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

11 Mar, 2014

12 commits

  • Bad idea on -rt:

    [ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
    [ 908.026145] [] task_numa_free+0x31/0x130
    [ 908.026151] [] finish_task_switch+0xce/0x100
    [ 908.026156] [] thread_return+0x48/0x4ae
    [ 908.026160] [] schedule+0x25/0xa0
    [ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
    [ 908.026170] [] get_signal_to_deliver+0xaf/0x680
    [ 908.026175] [] do_signal+0x3d/0x5b0
    [ 908.026179] [] do_notify_resume+0x90/0xe0
    [ 908.026186] [] int_signal+0x12/0x17
    [ 908.026193] [] 0x7ff2a388b1cf

    and since upstream does not mind where we do this, be a bit nicer ...

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Check for fair tasks number to decide, that we've pulled a task.
    rq's nr_running may contain throttled RT tasks.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1394118975.19290.104.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • 1) Single cpu machine case.

    When rq has only RT tasks, but no one of them can be picked
    because of throttling, we enter in endless loop.

    pick_next_task_{dl,rt} return NULL.

    In pick_next_task_fair() we permanently go to retry

    if (rq->nr_running != rq->cfs.h_nr_running)
    return RETRY_TASK;

    (rq->nr_running is not being decremented when rt_rq becomes
    throttled).

    No chances to unthrottle any rt_rq or to wake fair here,
    because of rq is locked permanently and interrupts are
    disabled.

    2) In case of SMP this can cause a hang too. Although we unlock
    rq in idle_balance(), interrupts are still disabled.

    The solution is to check for available tasks in DL and RT
    classes instead of checking for sum.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • We close idle_exit_fair() bracket in case of we've pulled something or we've received
    task of high priority class.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/1394098315.19290.10.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • The problems:

    1) We check for rt_nr_running before call of put_prev_task().
    If previous task is RT, its rt_rq may become throttled
    and dequeued after this call.

    In case of p is from rt->rq this just causes picking a task
    from throttled queue, but in case of its rt_rq is child
    we are guaranteed catch BUG_ON.

    2) The same with deadline class. The only difference we operate
    on only dl_rq.

    This patch fixes all the above problems and it adds a small skip in the
    DL update like we've already done for RT class:

    if (unlikely((s64)delta_exec
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • The idle main function is a complex and a critical function. Added more
    comments to the code.

    Signed-off-by: Daniel Lezcano
    Acked-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: tglx@linutronix.de
    Cc: rjw@rjwysocki.net
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-5-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • This patch moves the condition before entering idle into the cpuidle main
    function located in idle.c. That simplify the idle mainloop functions and
    increase the readibility of the conditions to enter truly idle.

    This patch is code reorganization and does not change the behavior of the
    function.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Peter Zijlstra
    Cc: tglx@linutronix.de
    Cc: rjw@rjwysocki.net
    Cc: nicolas.pitre@linaro.org
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-4-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • Now that we have the main cpuidle function in idle.c, move some code from
    the idle mainloop to this function for the sake of clarity.

    That removes if then else indentation difficult to follow when looking at the
    code. This patch does not change the current behavior.

    Signed-off-by: Daniel Lezcano
    Acked-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: tglx@linutronix.de
    Cc: rjw@rjwysocki.net
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • The cpuidle_idle_call does nothing more than calling the three individuals
    function and is no longer used by any arch specific code but only in the
    cpuidle framework code.

    We can move this function into the idle task code to ensure better
    proximity to the scheduler code.

    Signed-off-by: Daniel Lezcano
    Acked-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: rjw@rjwysocki.net
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • Pick up fixes before queueing up new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Prevent tracing of preempt_disable/enable() in sched_clock_cpu().
    When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are
    traced and this causes trace_clock() users (and probably others) to
    go into an infinite recursion. Systems with a stable sched_clock()
    are not affected.

    This problem is similar to that fixed by upstream commit 95ef1e52922
    ("KVM guest: prevent tracing recursion with kvmclock").

    Signed-off-by: Fernando Luis Vazquez Cao
    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexus
    Signed-off-by: Ingo Molnar

    Fernando Luis Vazquez Cao
     
  • Deny the use of SCHED_DEADLINE policy to unprivileged users.
    Even if root users can set the policy for normal users, we
    don't want the latter to be able to change their parameters
    (safest behavior).

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

27 Feb, 2014

7 commits

  • Michael spotted that the idle_balance() push down created a task
    priority problem.

    Previously, when we called idle_balance() before pick_next_task() it
    wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
    task slipped in.

    Similarly for pre_schedule(), rt pre-schedule could have a dl task
    slip in.

    But by pulling it into the pick_next_task() loop, we'll not try a
    higher task priority again.

    Cure this by creating a re-start condition in pick_next_task(); and
    triggering this from pick_next_task_{rt,fair}().

    It also fixes a live-lock where we get stuck in pick_next_task_fair()
    due to idle_balance() seeing !0 nr_running but there not actually
    being any fair tasks about.

    Reported-by: Michael Wang
    Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
    Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit cf37b6b48428d ("sched/idle: Move cpu/idle.c to sched/idle.c")
    said to simply move a file; somehow it got mangled and created an old
    version of the file and forgot to remove the old file.

    Fix this fail; add the lost change and remove the now identical old
    file.

    Signed-off-by: Peter Zijlstra
    Cc: rjw@rjwysocki.net
    Cc: nicolas.pitre@linaro.org
    Cc: preeti@linux.vnet.ibm.com
    Cc: Daniel Lezcano
    Link: http://lkml.kernel.org/r/20140224172207.GC9987@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The struct sched_avg of struct rq is only used in case group
    scheduling is enabled inside __update_tg_runnable_avg() to update
    per-cpu representation of a task group. I.e. that there is no need to
    maintain the runnable avg of a rq in the !CONFIG_FAIR_GROUP_SCHED case.

    This patch guards struct sched_avg of struct rq and
    update_rq_runnable_avg() with CONFIG_FAIR_GROUP_SCHED.

    There is an extra empty definition for update_rq_runnable_avg()
    necessary for the !CONFIG_FAIR_GROUP_SCHED && CONFIG_SMP case.

    The function print_cfs_group_stats() which prints out struct sched_avg
    of struct rq is already guarded with CONFIG_FAIR_GROUP_SCHED.

    Reviewed-by: Ben Segall
    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/530DCDC5.1060406@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • Kirill Tkhai noted:

    Since deadline tasks share rt bandwidth, we must care about
    bandwidth timer set. Otherwise rt_time may grow up to infinity
    in update_curr_dl(), if there are no other available RT tasks
    on top level bandwidth.

    RT task were in fact throttled right after they got enqueued,
    and never executed again (rt_time never again went below rt_runtime).

    Peter then proposed to accrue DL execution on rt_time only when
    rt timer is active, and proposed a patch (this patch is a slight
    modification of that) to implement that behavior. While this
    solves Kirill problem, it has a drawback.

    Indeed, Kirill noted again:

    It looks we may get into a situation, when all CPU time is shared
    between RT and DL tasks:

    rt_runtime = n
    rt_period = 2n

    | RT working, DL sleeping | DL working, RT sleeping |
    -----------------------------------------------------------
    | (1) duration = n | (2) duration = n | (repeat)
    |--------------------------|------------------------------|
    | (rt_bw timer is running) | (rt_bw timer is not running) |

    No time for fair tasks at all.

    While this can happen during the first period, if rq is always backlogged,
    RT tasks won't have the opportunity to execute anymore: rt_time reached
    rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
    throttled since rt timer didn't fire, replenishment is from now on eaten up
    by DL tasks that accrue their execution on rt_time (while rt timer is
    active - we have an RT task waiting for replenishment). FAIR tasks are
    not touched after this first period. Ok, this is not ideal, and the situation
    is even worse!

    What above (the nice case), practically never happens in reality, where
    your rt timer is not aligned to tasks periods, tasks are in general not
    periodic, etc.. Long story short, you always risk to overload your system.

    This patch is based on Peter's idea, but exploits an additional fact:
    if you don't have RT tasks enqueued, it makes little sense to continue
    incrementing rt_time once you reached the upper limit (DL tasks have their
    own mechanism for throttling).

    This cures both problems:

    - no matter how many DL instances in the past, you'll have an rt_time
    slightly above rt_runtime when an RT task is enqueued, and from that
    point on (after the first replenishment), the task will normally execute;

    - you can still eat up all bandwidth during the first period, but not
    anymore after that, remember that DL execution will increment rt_time
    till the upper limit is reached.

    The situation is still not perfect! But, we have a simple solution for now,
    that limits how much you can jeopardize your system, as we keep working
    towards the right answer: RT groups scheduled using deadline servers.

    Reported-by: Kirill Tkhai
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Commit 82b9580 ("sched/deadline: Test for CPU's presence explicitly")
    changed how we check if a CPU returned by cpudeadline machinery is
    valid. But, we don't want to call cpu_present() if best_cpu is
    equal to -1. So, switch the order of tests inside WARN_ON().

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: boris.ostrovsky@oracle.com
    Cc: konrad.wilk@oracle.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In deadline class we do not have group scheduling.

    So, let's remove unnecessary

    X = X;

    equations.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
    which appears to guarentee that the !se->on_rq condition is met.
    If the task has done set_current_state(TASK_INTERRUPTIBLE) without
    schedule() the second condition will be met and vruntime will be
    incorrectly adjusted twice.

    In certain cases this can result in the task's vruntime never increasing
    past the vruntime of other tasks on the CFS' run queue, starving them of
    CPU time.

    This patch changes switched_from_fair() to use !p->on_rq instead of
    !se->on_rq.

    I'm able to cause a task with a priority of 120 to starve all other
    tasks with the same priority on an ARM platform running 3.2.51-rt72
    PREEMPT RT by writing one character at time to a serial tty (16550 UART)
    in a tight loop. I'm also able to verify making this change corrects the
    problem on that platform and kernel version.

    Signed-off-by: George McCollister
    Signed-off-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
    Signed-off-by: Ingo Molnar

    George McCollister
     

25 Feb, 2014

1 commit

  • The name __smp_call_function_single() doesn't tell much about the
    properties of this function, especially when compared to
    smp_call_function_single().

    The comments above the implementation are also misleading. The main
    point of this function is actually not to be able to embed the csd
    in an object. This is actually a requirement that result from the
    purpose of this function which is to raise an IPI asynchronously.

    As such it can be called with interrupts disabled. And this feature
    comes at the cost of the caller who then needs to serialize the
    IPIs on this csd.

    Lets rename the function and enhance the comments so that they reflect
    these properties.

    Suggested-by: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker