08 Apr, 2014

3 commits

  • Merge second patch-bomb from Andrew Morton:
    - the rest of MM
    - zram updates
    - zswap updates
    - exit
    - procfs
    - exec
    - wait
    - crash dump
    - lib/idr
    - rapidio
    - adfs, affs, bfs, ufs
    - cris
    - Kconfig things
    - initramfs
    - small amount of IPC material
    - percpu enhancements
    - early ioremap support
    - various other misc things

    * emailed patches from Andrew Morton : (156 commits)
    MAINTAINERS: update Intel C600 SAS driver maintainers
    fs/ufs: remove unused ufs_super_block_third pointer
    fs/ufs: remove unused ufs_super_block_second pointer
    fs/ufs: remove unused ufs_super_block_first pointer
    fs/ufs/super.c: add __init to init_inodecache()
    doc/kernel-parameters.txt: add early_ioremap_debug
    arm64: add early_ioremap support
    arm64: initialize pgprot info earlier in boot
    x86: use generic early_ioremap
    mm: create generic early_ioremap() support
    x86/mm: sparse warning fix for early_memremap
    lglock: map to spinlock when !CONFIG_SMP
    percpu: add preemption checks to __this_cpu ops
    vmstat: use raw_cpu_ops to avoid false positives on preemption checks
    slub: use raw_cpu_inc for incrementing statistics
    net: replace __this_cpu_inc in route.c with raw_cpu_inc
    modules: use raw_cpu_write for initialization of per cpu refcount.
    mm: use raw_cpu ops for determining current NUMA node
    percpu: add raw_cpu_ops
    slub: fix leak of 'name' in sysfs_slab_add
    ...

    Linus Torvalds
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes
    with the right macro in the kernel subsystem.

    Signed-off-by: Gideon Israel Dsouza
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • This is the final piece in the puzzle, as all patches to remove the
    last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
    into the 3.15 merge window. The work was long overdue, and this
    interface in particular should not have survived the BKL removal
    that was done a couple of years ago.

    Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

    "[...] it was suggested that the janitors look for and fix all code
    that calls sleep_on() [...] since (1) almost all such code is
    incorrect, and (2) Linus has agreed that those functions should
    be removed in the 2.5 development series".

    We haven't quite made it for 2.5, but maybe we can merge this for 3.15.

    Signed-off-by: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

02 Apr, 2014

3 commits

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     
  • Pull timer changes from Thomas Gleixner:
    "This assorted collection provides:

    - A new timer based timer broadcast feature for systems which do not
    provide a global accessible timer device. That allows those
    systems to put CPUs into deep idle states where the per cpu timer
    device stops.

    - A few NOHZ_FULL related improvements to the timer wheel

    - The usual updates to timer devices found in ARM SoCs

    - Small improvements and updates all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    tick: Remove code duplication in tick_handle_periodic()
    tick: Fix spelling mistake in tick_handle_periodic()
    x86: hpet: Use proper destructor for delayed work
    workqueue: Provide destroy_delayed_work_on_stack()
    clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
    timer: Remove code redundancy while calling get_nohz_timer_target()
    hrtimer: Rearrange comments in the order struct members are declared
    timer: Use variable head instead of &work_list in __run_timers()
    clocksource: exynos_mct: silence a static checker warning
    arm: zynq: Add support for cpufreq
    arm: zynq: Don't use arm_global_timer with cpufreq
    clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
    clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
    clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
    sh: Remove Kconfig entries for TMU, CMT and MTU2
    ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
    clocksource: armada-370-xp: Use atomic access for shared registers
    clocksource: orion: Use atomic access for shared registers
    clocksource: timer-keystone: Delete unnecessary variable
    clocksource: timer-keystone: introduce clocksource driver for Keystone
    ...

    Linus Torvalds
     
  • Pull timer updates from Ingo Molnar:
    "The main purpose is to fix a full dynticks bug related to
    virtualization, where steal time accounting appears to be zero in
    /proc/stat even after a few seconds of competing guests running busy
    loops in a same host CPU. It's not a regression though as it was
    there since the beginning.

    The other commits are preparatory work to fix the bug and various
    cleanups"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arch: Remove stub cputime.h headers
    sched: Remove needless round trip nsecs tick conversion of steal time
    cputime: Fix jiffies based cputime assumption on steal accounting
    cputime: Bring cputime -> nsecs conversion
    cputime: Default implementation of nsecs -> cputime conversion
    cputime: Fix nsecs_to_cputime() return type cast

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "There are two memory management related changes, the CMMA support for
    KVM to avoid swap-in of freed pages and the split page table lock for
    the PMD level. These two come with common code changes in mm/.

    A fix for the long standing theoretical TLB flush problem, this one
    comes with a common code change in kernel/sched/.

    Another set of changes is Heikos uaccess work, included is the initial
    set of patches with more to come.

    And fixes and cleanups as usual"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
    s390/con3270: optionally disable auto update
    s390/mm: remove unecessary parameter from pgste_ipte_notify
    s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
    s390/mm: fixing comment so that parameter name match
    s390/smp: limit number of cpus in possible cpu mask
    hypfs: Add clarification for "weight_min" attribute
    s390: update defconfigs
    s390/ptrace: add support for PTRACE_SINGLEBLOCK
    s390/perf: make print_debug_cf() static
    s390/topology: Remove call to update_cpu_masks()
    s390/compat: remove compat exec domain
    s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
    s390/appldata_os: fix cpu array size calculation
    s390/checksum: remove memset() within csum_partial_copy_from_user()
    s390/uaccess: remove copy_from_user_real()
    s390/sclp_early: Return correct HSA block count also for zero
    s390: add some drivers/subsystems to the MAINTAINERS file
    s390: improve debug feature usage
    s390/airq: add support for irq ranges
    s390/mm: enable split page table lock for PMD level
    ...

    Linus Torvalds
     

20 Mar, 2014

1 commit

  • There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
    call it under same circumstances, i.e.

    #ifdef CONFIG_NO_HZ_COMMON
    if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
    return get_nohz_timer_target();
    #endif

    So, it makes more sense to get all this as part of get_nohz_timer_target()
    instead of duplicating code at two places. For this another parameter is
    required to be passed to this routine, pinned.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

13 Mar, 2014

1 commit

  • When update_rq_clock_task() accounts the pending steal time for a task,
    it converts the steal delta from nsecs to tick then from tick to nsecs.

    There is no apparent good reason for doing that though because both
    the task clock and the prev steal delta are u64 and store values
    in nsecs.

    So lets remove the needless conversion.

    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Acked-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

12 Mar, 2014

1 commit

  • I decided to run my tests on linux-next, and my wakeup_rt tracer was
    broken. After running a bisect, I found that the problem commit was:

    linux-next commit c365c292d059
    "sched: Consider pi boosting in setscheduler()"

    And the reason the wake_rt tracer test was failing, was because it had
    no RT task to trace. I first noticed this when running with
    sched_switch event and saw that my RT task still had normal SCHED_OTHER
    priority. Looking at the problem commit, I found:

    - p->normal_prio = normal_prio(p);
    - p->prio = rt_mutex_getprio(p);

    With no

    + p->normal_prio = normal_prio(p);
    + p->prio = rt_mutex_getprio(p);

    Reading what the commit is suppose to do, I realize that the p->prio
    can't be set if the task is boosted with a higher prio, but the
    p->normal_prio still needs to be set regardless, otherwise, when the
    task is deboosted, it wont get the new priority.

    The p->prio has to be set before "check_class_changed()" is called,
    otherwise the class wont be changed.

    Also added fix to newprio to include a check for deadline policy that
    was missing. This change was suggested by Juri Lelli.

    Signed-off-by: Steven Rostedt
    Cc: SebastianAndrzej Siewior
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

11 Mar, 2014

3 commits

  • Bad idea on -rt:

    [ 908.026136] [] rt_spin_lock_slowlock+0xaa/0x2c0
    [ 908.026145] [] task_numa_free+0x31/0x130
    [ 908.026151] [] finish_task_switch+0xce/0x100
    [ 908.026156] [] thread_return+0x48/0x4ae
    [ 908.026160] [] schedule+0x25/0xa0
    [ 908.026163] [] rt_spin_lock_slowlock+0xd5/0x2c0
    [ 908.026170] [] get_signal_to_deliver+0xaf/0x680
    [ 908.026175] [] do_signal+0x3d/0x5b0
    [ 908.026179] [] do_notify_resume+0x90/0xe0
    [ 908.026186] [] int_signal+0x12/0x17
    [ 908.026193] [] 0x7ff2a388b1cf

    and since upstream does not mind where we do this, be a bit nicer ...

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Pick up fixes before queueing up new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Deny the use of SCHED_DEADLINE policy to unprivileged users.
    Even if root users can set the policy for normal users, we
    don't want the latter to be able to change their parameters
    (safest behavior).

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

27 Feb, 2014

1 commit

  • Michael spotted that the idle_balance() push down created a task
    priority problem.

    Previously, when we called idle_balance() before pick_next_task() it
    wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
    task slipped in.

    Similarly for pre_schedule(), rt pre-schedule could have a dl task
    slip in.

    But by pulling it into the pick_next_task() loop, we'll not try a
    higher task priority again.

    Cure this by creating a re-start condition in pick_next_task(); and
    triggering this from pick_next_task_{rt,fair}().

    It also fixes a live-lock where we get stuck in pick_next_task_fair()
    due to idle_balance() seeing !0 nr_running but there not actually
    being any fair tasks about.

    Reported-by: Michael Wang
    Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
    Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Feb, 2014

2 commits

  • The name __smp_call_function_single() doesn't tell much about the
    properties of this function, especially when compared to
    smp_call_function_single().

    The comments above the implementation are also misleading. The main
    point of this function is actually not to be able to embed the csd
    in an object. This is actually a requirement that result from the
    purpose of this function which is to raise an IPI asynchronously.

    As such it can be called with interrupts disabled. And this feature
    comes at the cost of the caller who then needs to serialize the
    IPIs on this csd.

    Lets rename the function and enhance the comments so that they reflect
    these properties.

    Suggested-by: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     
  • The main point of calling __smp_call_function_single() is to send
    an IPI in a pure asynchronous way. By embedding a csd in an object,
    a caller can send the IPI without waiting for a previous one to complete
    as is required by smp_call_function_single() for example. As such,
    sending this kind of IPI can be safe even when irqs are disabled.

    This flexibility comes at the expense of the caller who then needs to
    synchronize the csd lifecycle by himself and make sure that IPIs on a
    single csd are serialized.

    This is how __smp_call_function_single() works when wait = 0 and this
    usecase is relevant.

    Now there don't seem to be any usecase with wait = 1 that can't be
    covered by smp_call_function_single() instead, which is safer. Lets look
    at the two possible scenario:

    1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
    in an object. It looks like a nice and convenient pattern at the first
    sight because we can then retrieve the object from the IPI handler easily.

    But actually it is a waste of memory space in the object since the csd
    can be allocated from the stack by smp_call_function_single(wait = 1)
    and the object can be passed an the IPI argument.

    Besides that, embedding the csd in an object is more error prone
    because the caller must take care of the serialization of the IPIs
    for this csd.

    2) The user calls __smp_call_function_single(wait = 1) on a csd that
    is allocated on the stack. It's ok but smp_call_function_single()
    can do it as well and it already takes care of the allocation on the
    stack. Again it's more simple and less error prone.

    Therefore, using the underscore prepend API version with wait = 1
    is a bad pattern and a sign that the caller can do safer and more
    simple.

    There was a single user of that which has just been converted.
    So lets remove this option to discourage further users.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     

23 Feb, 2014

8 commits

  • Signed-off-by: Dongsheng Yang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/bd80780f19b4f9b4a765acc353c8dbc130274dd6.1392103744.git.yangds.fnst@cn.fujitsu.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     
  • This is a leftover from commit e23ee74777f389369431d77390c4b09332ce026a
    ("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").

    Signed-off-by: Li Zefan
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Li Zefan
     
  • If a PI boosted task policy/priority is modified by a setscheduler()
    call we unconditionally dequeue and requeue the task if it is on the
    runqueue even if the new priority is lower than the current effective
    boosted priority. This can result in undesired reordering of the
    priority bucket list.

    If the new priority is less or equal than the current effective we
    just store the new parameters in the task struct and leave the
    scheduler class and the runqueue untouched. This is handled when the
    task deboosts itself. Only if the new priority is higher than the
    effective boosted priority we apply the change immediately.

    Signed-off-by: Thomas Gleixner
    [ Rebase ontop of v3.14-rc1. ]
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Dario Faggioli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The following scenario does not work correctly:

    Runqueue of CPUx contains two runnable and pinned tasks:

    T1: SCHED_FIFO, prio 80
    T2: SCHED_FIFO, prio 80

    T1 is on the cpu and executes the following syscalls (classic priority
    ceiling scenario):

    sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
    ...
    sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
    ...

    Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
    to sleep the scheduler picks T2. Surprise!

    The same happens w/o actual preemption when T1 is forced into the
    scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
    pick_next_task() which returns T2. So T1 gets preempted and scheduled
    out.

    This happens because sched_setscheduler() dequeues T1 from the prio 90
    list and then enqueues it on the tail of the prio 80 list behind T2.
    This violates the POSIX spec and surprises user space which relies on
    the guarantee that SCHED_FIFO tasks are not scheduled out unless they
    give the CPU up voluntarily or are preempted by a higher priority
    task. In the latter case the preempted task must get back on the CPU
    after the preempting task schedules out again.

    We fixed a similar issue already in commit 60db48c (sched: Queue a
    deboosted task to the head of the RT prio queue). The same treatment
    is necessary for sched_setscheduler(). So enqueue to head of the prio
    bucket list if the priority of the task is lowered.

    It might be possible that existing user space relies on the current
    behaviour, but it can be considered highly unlikely due to the corner
    case nature of the application scenario.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • If the policy and priority remain unchanged a possible modification of
    p->sched_reset_on_fork gets lost in the early exit path.

    Signed-off-by: Thomas Gleixner
    [ Rebase ontop of v3.14-rc1. ]
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • might_sleep() can tell us where interrupts have been disabled, but we
    have no idea what disabled preemption. Add some debug infrastructure.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Idle is not allowed to call sleeping functions ever!

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • We stumbled in RT over a SMP bringup issue on ARM where the
    idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run
    into nada land.

    After adding that idle->on_rq = 1; I was able to find the root cause
    of the lockup: the idle task on the newly woken up cpu was fiddling
    with a sleeping spinlock, which is a nono.

    I kept the init of idle->on_rq to keep the state consistent and to
    avoid another long lasting debug session.

    As a side note, the whole debug mess could have been avoided if
    might_sleep() would have yelled when called from the idle task. That's
    fixed with patch 2/6 - and that one actually has a changelog :)

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

22 Feb, 2014

6 commits

  • Dan Carpenter reported:

    > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
    > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)

    Kirill also spotted that migrate_tasks() will have an instant NULL
    deref because pick_next_task() will immediately deref prev.

    Instead of fixing all the corner cases because migrate_tasks() can
    pass in a NULL prev task in the unlikely case of hot-un-plug, provide
    a fake task such that we can remove all the NULL checks from the far
    more common paths.

    A further problem; not previously spotted; is that because we pushed
    pre_schedule() and idle_balance() into pick_next_task() we now need to
    avoid those getting called and pulling more tasks on our dying CPU.

    We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
    We also note that since we call pick_next_task() exactly the amount of
    times we have runnable tasks present, we should never land in
    idle_balance().

    Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
    Cc: Juri Lelli
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Reported-by: Kirill Tkhai
    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Because of a recent syscall design debate; its deemed appropriate for
    each syscall to have a flags argument for future extension; without
    immediately requiring new syscalls.

    Cc: juri.lelli@gmail.com
    Cc: Ingo Molnar
    Suggested-by: Michael Kerrisk
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • We're copying the on-stack structure to userspace, but forgot to give
    the right number of bytes to copy. This allows the calling process to
    obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent
    kernel memory).

    This fix copies only as much as we actually have on the stack
    (attr->size defaults to the size of the struct) and leaves the rest of
    the userspace-provided buffer untouched.

    Found using kmemcheck + trinity.

    Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
    Cc: Dario Faggioli
    Cc: Juri Lelli
    Cc: Ingo Molnar
    Signed-off-by: Vegard Nossum
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Thomas Gleixner

    Vegard Nossum
     
  • Fix this lockdep warning:

    [ 44.804600] =========================================================
    [ 44.805746] [ INFO: possible irq lock inversion dependency detected ]
    [ 44.805746] 3.14.0-rc2-test+ #14 Not tainted
    [ 44.805746] ---------------------------------------------------------
    [ 44.805746] bash/3674 just changed the state of lock:
    [ 44.805746] (&dl_b->lock){+.....}, at: [] sched_rt_handler+0x132/0x248
    [ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 44.805746] (&rq->lock){-.-.-.}

    and interrupts could create inverse lock ordering between them.

    [ 44.805746]
    [ 44.805746] other info that might help us debug this:
    [ 44.805746] Possible interrupt unsafe locking scenario:
    [ 44.805746]
    [ 44.805746] CPU0 CPU1
    [ 44.805746] ---- ----
    [ 44.805746] lock(&dl_b->lock);
    [ 44.805746] local_irq_disable();
    [ 44.805746] lock(&rq->lock);
    [ 44.805746] lock(&dl_b->lock);
    [ 44.805746]
    [ 44.805746] lock(&rq->lock);

    by making dl_b->lock acquiring always IRQ safe.

    Cc: Ingo Molnar
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     
  • Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if
    the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth
    management (with CONFIG_RT_GROUP_SCHED=n) fails.

    Cc: Ingo Molnar
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     
  • While debugging the crash with the bad nr_running accounting, I hit
    another bug where, after running my sched deadline test, I was getting
    failures to take a CPU offline. It was giving me a -EBUSY error.

    Adding a bunch of trace_printk()s around, I found that the cpu
    notifier that called sched_cpu_inactive() was returning a failure. The
    overflow value was coming up negative?

    Talking this over with Juri, the problem is that the total_bw update was
    suppose to be made by dl_overflow() which, during my tests, seemed to
    not be called. Adding more trace_printk()s, it wasn't that it wasn't
    called, but it exited out right away with the check of new_bw being
    equal to p->dl.dl_bw. The new_bw calculates the ratio between period and
    runtime. The bug is that if you set a deadline, you do not need to set
    a period if you plan on the period being equal to the deadline. That
    is, if period is zero and deadline is not, then the system call should
    set the period to be equal to the deadline. This is done elsewhere in
    the code.

    The fix is easy, check if period is set, and if it is not, then use the
    deadline.

    Cc: Juri Lelli
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home
    Signed-off-by: Thomas Gleixner

    Steven Rostedt
     

21 Feb, 2014

1 commit

  • The finish_arch_post_lock_switch is called at the end of the task
    switch after all locks have been released. In concept it is paired
    with the switch_mm function, but the current code only does the
    call in finish_task_switch. Add the call to idle_task_exit and
    use_mm. One use case for the additional calls is s390 which will
    use finish_arch_post_lock_switch to wait for the completion of
    TLB flush operations.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

13 Feb, 2014

1 commit

  • If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
    css. The intention of the interface is to make it easy to skip css's
    (cgroup_subsys_states) which already match the migration target;
    however, this is entirely unnecessary as migration taskset doesn't
    include tasks which are already in the target cgroup. Drop @skip_css
    from cgroup_taskset_for_each().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann

    Tejun Heo
     

11 Feb, 2014

2 commits

  • Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost.
    It's useful to know these values in debug mode.

    Signed-off-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.org
    Signed-off-by: Ingo Molnar

    Alex Shi
     
  • This patch both merged idle_balance() and pre_schedule() and pushes
    both of them into pick_next_task().

    Conceptually pre_schedule() and idle_balance() are rather similar,
    both are used to pull more work onto the current CPU.

    We cannot however first move idle_balance() into pre_schedule_fair()
    since there is no guarantee the last runnable task is a fair task, and
    thus we would miss newidle balances.

    Similarly, the dl and rt pre_schedule calls must be ran before
    idle_balance() since their respective tasks have higher priority and
    it would not do to delay their execution searching for less important
    tasks first.

    However, by noticing that pick_next_tasks() already traverses the
    sched_class hierarchy in the right order, we can get the right
    behaviour and do away with both calls.

    We must however change the special case optimization to also require
    that prev is of sched_class_fair, otherwise we can miss doing a dl or
    rt pull where we needed one.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Feb, 2014

3 commits

  • In order to avoid having to do put/set on a whole cgroup hierarchy
    when we context switch, push the put into pick_next_task() so that
    both operations are in the same function. Further changes then allow
    us to possibly optimize away redundant work.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • idle_balance() modifies the rq->idle_stamp field, making this information
    shared across core.c and fair.c.

    As we know if the cpu is going to idle or not with the previous patch, let's
    encapsulate the rq->idle_stamp information in core.c by moving it up to the
    caller.

    The idle_balance() function returns true in case a balancing occured and the
    cpu won't be idle, false if no balance happened and the cpu is going idle.

    Signed-off-by: Daniel Lezcano
    Cc: alex.shi@linaro.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • The cpu parameter passed to idle_balance() is not needed as it could
    be retrieved from 'struct rq.'

    Signed-off-by: Daniel Lezcano
    Cc: alex.shi@linaro.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1389949444-14821-1-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     

09 Feb, 2014

1 commit

  • As patch "sched: Move the priority specific bits into a new header file" exposes
    the priority related macros in linux/sched/prio.h, we don't have to implement
    task_nice() in kernel/sched/core.c any more.

    This patch implements it in linux/sched/sched.h as static inline function,
    saving the kernel stack and enhancing performance a bit.

    Signed-off-by: Dongsheng Yang
    Cc: clark.williams@gmail.com
    Cc: rostedt@goodmis.org
    Cc: raistlin@linux.it
    Cc: juri.lelli@gmail.com
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

08 Feb, 2014

1 commit

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo