02 Apr, 2014

1 commit

  • Pull timer changes from Thomas Gleixner:
    "This assorted collection provides:

    - A new timer based timer broadcast feature for systems which do not
    provide a global accessible timer device. That allows those
    systems to put CPUs into deep idle states where the per cpu timer
    device stops.

    - A few NOHZ_FULL related improvements to the timer wheel

    - The usual updates to timer devices found in ARM SoCs

    - Small improvements and updates all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    tick: Remove code duplication in tick_handle_periodic()
    tick: Fix spelling mistake in tick_handle_periodic()
    x86: hpet: Use proper destructor for delayed work
    workqueue: Provide destroy_delayed_work_on_stack()
    clocksource: CMT, MTU2, TMU and STI should depend on GENERIC_CLOCKEVENTS
    timer: Remove code redundancy while calling get_nohz_timer_target()
    hrtimer: Rearrange comments in the order struct members are declared
    timer: Use variable head instead of &work_list in __run_timers()
    clocksource: exynos_mct: silence a static checker warning
    arm: zynq: Add support for cpufreq
    arm: zynq: Don't use arm_global_timer with cpufreq
    clocksource/cadence_ttc: Overhaul clocksource frequency adjustment
    clocksource/cadence_ttc: Call clockevents_update_freq() with IRQs enabled
    clocksource: Add Kconfig entries for CMT, MTU2, TMU and STI
    sh: Remove Kconfig entries for TMU, CMT and MTU2
    ARM: shmobile: Remove CMT, TMU and STI Kconfig entries
    clocksource: armada-370-xp: Use atomic access for shared registers
    clocksource: orion: Use atomic access for shared registers
    clocksource: timer-keystone: Delete unnecessary variable
    clocksource: timer-keystone: introduce clocksource driver for Keystone
    ...

    Linus Torvalds
     

20 Mar, 2014

2 commits

  • There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
    call it under same circumstances, i.e.

    #ifdef CONFIG_NO_HZ_COMMON
    if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
    return get_nohz_timer_target();
    #endif

    So, it makes more sense to get all this as part of get_nohz_timer_target()
    instead of duplicating code at two places. For this another parameter is
    required to be passed to this routine, pinned.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     
  • We already have a variable 'head' that points to '&work_list', and so
    we should use that instead wherever possible.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Link: http://lkml.kernel.org/r/0d8645a6efc8360c4196c9797d59343abbfdcc5e.1395129136.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

04 Mar, 2014

2 commits

  • Currently we are using two lowest bit of base for internal purpose and
    so they both should be zero in the allocated address. The code was
    doing the right thing before this patch came in: commit c5f66e99b
    (timer: Implement TIMER_IRQSAFE)

    Tejun probably forgot to update this piece of code which checks if the
    lowest 'n' bits are zero or not and so wasn't updated according to the
    new flag. Lets use TIMER_FLAG_MASK in the calculations here.

    [ tglx: Massaged changelog ]

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: tj@kernel.org
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/9144e10d7e854a0aa8a673332adec356d81a923c.1393576981.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     
  • timer_cpu_notify() should return NOTIFY_OK and nothing else. Anything else would
    trigger a BUG_ON(). Return value of this routine is already checked correctly
    but is done after issuing a call to init_timer_stats(). The right order would be
    to check the error case first and then call init_timer_stats(). Lets do it.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: tj@kernel.org
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/c439f5b6bbc2047e1662f4d523350531425bcf9d.1393576981.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner

    Viresh Kumar
     

28 Feb, 2014

1 commit


26 Feb, 2014

5 commits

  • The internal_add_timer() function updates base->next_timer only if
    timer->expires < base->next_timer. This is correct, but it also makes
    sense to do the same if we add the first non-deferrable timer.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: Peter Zijlstra
    Tested-by: Mike Galbraith

    Oleg Nesterov
     
  • The __run_timers() function currently steps through the list one jiffy at
    a time in order to update the timer wheel. However, if the timer wheel
    is empty, no adjustment is needed other than updating ->timer_jiffies.
    Therefore, just before we add a timer to an empty timer wheel, we should
    mark the timer wheel as being up to date. This marking will reduce (and
    perhaps eliminate) the jiffy-stepping that a future __run_timers() call
    will need to do in response to some future timer posting or migration.
    This commit therefore updates ->timer_jiffies for this case.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Steven Rostedt
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • The __run_timers() function currently steps through the list one jiffy at
    a time in order to update the timer wheel. However, if the timer wheel
    is empty, no adjustment is needed other than updating ->timer_jiffies.
    Therefore, if we just emptied the timer wheel, for example, by deleting
    the last timer, we should mark the timer wheel as being up to date.
    This marking will reduce (and perhaps eliminate) the jiffy-stepping that
    a future __run_timers() call will need to do in response to some future
    timer posting or migration. This commit therefore catches ->timer_jiffies
    for this case.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Steven Rostedt
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • The __run_timers() function currently steps through the list one jiffy at
    a time in order to update the timer wheel. However, if the timer wheel
    is empty, no adjustment is needed other than updating ->timer_jiffies.
    In this case, which is likely to be common for NO_HZ_FULL kernels, the
    kernel currently incurs a large latency for no good reason. This commit
    therefore short-circuits this case.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Steven Rostedt
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • Currently, the tvec_base structure's ->active_timers field tracks only
    the non-deferrable timers, which means that even if ->active_timers is
    zero, there might well be deferrable timers in the list. This commit
    therefore adds an ->all_timers field to track all the timers, whether
    deferrable or not.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Steven Rostedt
    Tested-by: Mike Galbraith

    Paul E. McKenney
     

15 Feb, 2014

1 commit

  • When a timer is enqueued or modified on a remote target, the latter is
    expected to see and handle this timer on its next tick. However if the
    target is idle and CONFIG_NO_HZ_IDLE=y, the CPU may be sleeping tickless
    and the timer may be ignored.

    wake_up_nohz_cpu() takes care of that by setting TIF_NEED_RESCHED and
    sending an IPI to idle targets so that the tick is reevaluated on the
    idle loop through the tick_nohz_idle_*() APIs.

    Now this is all performed regardless of the power properties of the
    timer. If the timer is deferrable, idle targets don't need to be woken
    up. Only the next buzy tick needs to care about it, and no IPI kick
    is needed for that to happen.

    So lets spare the IPI on idle targets when the timer is deferrable.

    Meanwhile we keep the current behaviour on full dynticks targets. We can
    spare IPIs on idle full dynticks targets as well but some tricky races
    against idle_cpu() must be dealt all along to make sure that the timer
    is well handled after idle exit. We can deal with that later since
    NO_HZ_FULL already has more important powersaving issues.

    Reported-by: Thomas Gleixner
    Signed-off-by: Viresh Kumar
    Cc: Ingo Molnar
    Cc: Paul Gortmaker
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/CAKohpomMZ0TAN2e6N76_g4ZRzxd5vZ1XfuZfxrP7GMxfTNiLVw@mail.gmail.com
    Signed-off-by: Frederic Weisbecker

    Viresh Kumar
     

14 Feb, 2014

1 commit

  • Jiffies is referenced by the linker script, so it has to be visible.

    Handled both the generic and the x86 version.

    Signed-off-by: Andi Kleen
    Link: http://lkml.kernel.org/r/1391845930-28580-3-git-send-email-ak@linux.intel.com
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     

19 Nov, 2013

1 commit


25 Sep, 2013

1 commit

  • Replace the single preempt_count() 'function' that's an lvalue with
    two proper functions:

    preempt_count() - returns the preempt_count value as rvalue
    preempt_count_set() - Allows setting the preempt-count value

    Also provide preempt_count_ptr() as a convenience wrapper to implement
    all modifying operations.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
    [ Fixed build failure. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

28 Jun, 2013

1 commit


16 May, 2013

1 commit

  • Pull timer fixes from Thomas Gleixner:

    - Cure for not using zalloc in the first place, which leads to random
    crashes with CPUMASK_OFF_STACK.

    - Revert a user space visible change which broke udev

    - Add a missing cpu_online early return introduced by the new full
    dyntick conversions

    - Plug a long standing race in the timer wheel cpu hotplug code.
    Sigh...

    - Cleanup NOHZ per cpu data on cpu down to prevent stale data on cpu
    up.

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
    timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
    tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
    tick: Cleanup NOHZ per cpu data on cpu down
    tick: Use zalloc_cpumask_var for allocating offstack cpumasks

    Linus Torvalds
     

14 May, 2013

1 commit

  • An inactive timer's base can refer to a offline cpu's base.

    In the current code, cpu_base's lock is blindly reinitialized each
    time a CPU is brought up. If a CPU is brought online during the period
    that another thread is trying to modify an inactive timer on that CPU
    with holding its timer base lock, then the lock will be reinitialized
    under its feet. This leads to following SPIN_BUG().

    BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466
    lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1
    [] (unwind_backtrace+0x0/0x11c) from [] (do_raw_spin_unlock+0x40/0xcc)
    [] (do_raw_spin_unlock+0x40/0xcc) from [] (_raw_spin_unlock+0x8/0x30)
    [] (_raw_spin_unlock+0x8/0x30) from [] (mod_timer+0x294/0x310)
    [] (mod_timer+0x294/0x310) from [] (queue_delayed_work_on+0x104/0x120)
    [] (queue_delayed_work_on+0x104/0x120) from [] (sdhci_msm_bus_voting+0x88/0x9c)
    [] (sdhci_msm_bus_voting+0x88/0x9c) from [] (sdhci_disable+0x40/0x48)
    [] (sdhci_disable+0x40/0x48) from [] (mmc_release_host+0x4c/0xb0)
    [] (mmc_release_host+0x4c/0xb0) from [] (mmc_sd_detect+0x90/0xfc)
    [] (mmc_sd_detect+0x90/0xfc) from [] (mmc_rescan+0x7c/0x2c4)
    [] (mmc_rescan+0x7c/0x2c4) from [] (process_one_work+0x27c/0x484)
    [] (process_one_work+0x27c/0x484) from [] (worker_thread+0x210/0x3b0)
    [] (worker_thread+0x210/0x3b0) from [] (kthread+0x80/0x8c)
    [] (kthread+0x80/0x8c) from [] (kernel_thread_exit+0x0/0x8)

    As an example, this particular crash occurred when CPU #3 is executing
    mod_timer() on an inactive timer whose base is refered to offlined CPU
    #2. The code locked the timer_base corresponding to CPU #2. Before it
    could proceed, CPU #2 came online and reinitialized the spinlock
    corresponding to its base. Thus now CPU #3 held a lock which was
    reinitialized. When CPU #3 finally ended up unlocking the old cpu_base
    corresponding to CPU #2, we hit the above SPIN_BUG().

    CPU #0 CPU #3 CPU #2
    ------ ------- -------
    ..... ......
    mod_timer()
    lock_timer_base
    spin_lock_irqsave(&base->lock)

    cpu_up(2) ..... ......
    init_timers_cpu()
    .... ..... spin_lock_init(&base->lock)
    ..... spin_unlock_irqrestore(&base->lock) ......

    Allocation of per_cpu timer vector bases is done only once under
    "tvec_base_done[]" check. In the current code, spinlock_initialization
    of base->lock isn't under this check. When a CPU is up each time the
    base lock is reinitialized. Move base spinlock initialization under
    the check.

    Signed-off-by: Tirupathi Reddy
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.org
    Signed-off-by: Thomas Gleixner

    Tirupathi Reddy
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

01 May, 2013

3 commits

  • Andrew Morton noted:

    akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
    SYSCALL_DEFINE1(alarm, unsigned int, seconds)
    SYSCALL_DEFINE0(getpid)
    SYSCALL_DEFINE0(getppid)
    SYSCALL_DEFINE0(getuid)
    SYSCALL_DEFINE0(geteuid)
    SYSCALL_DEFINE0(getgid)
    SYSCALL_DEFINE0(getegid)
    SYSCALL_DEFINE0(gettid)
    SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
    COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

    Only one of those should be in kernel/timer.c. Who wrote this thing?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Rothwell
    Acked-by: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Signed-off-by: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • The only use outside of kernel/timer.c was in kernel/compat.c, so move
    compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.

    Signed-off-by: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

03 Apr, 2013

1 commit

  • We are planning to convert the dynticks Kconfig options layout
    into a choice menu. The user must be able to easily pick
    any of the following implementations: constant periodic tick,
    idle dynticks, full dynticks.

    As this implies a mutual exclusion, the two dynticks implementions
    need to converge on the selection of a common Kconfig option in order
    to ease the sharing of a common infrastructure.

    It would thus seem pretty natural to reuse CONFIG_NO_HZ to
    that end. It already implements all the idle dynticks code
    and the full dynticks depends on all that code for now.
    So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
    CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.

    On the other hand we want to stay backward compatible: if
    CONFIG_NO_HZ is set in an older config file, we want to
    enable CONFIG_NO_HZ_IDLE by default.

    But we can't afford both at the same time or we run into
    a circular dependency:

    1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
    CONFIG_NO_HZ
    2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE

    We might be able to support that from Kconfig/Kbuild but it
    may not be wise to introduce such a confusing behaviour.

    So to solve this, create a new CONFIG_NO_HZ_COMMON option
    which gathers the common code between idle and full dynticks
    (that common code for now is simply the idle dynticks code)
    and select it from their referring Kconfig.

    Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
    to it for backward compatibility.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

21 Mar, 2013

1 commit

  • Wake up a CPU when a timer list timer is enqueued there and
    the target is part of the full dynticks range. Sending an IPI
    to it makes it reconsidering the next timer to program on top
    of recent updates.

    This may later be improved by checking if the tick is really
    stopped on the target. This would need some careful
    synchronization though. So deal with such optimization later
    and start simple.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

20 Feb, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

18 Nov, 2012

1 commit

  • klogd is woken up asynchronously from the tick in order
    to do it safely.

    However if printk is called when the tick is stopped, the reader
    won't be woken up until the next interrupt, which might not fire
    for a while. As a result, the user may miss some message.

    To fix this, lets implement the printk tick using a lazy irq work.
    This subsystem takes care of the timer tick state and can
    fix up accordingly.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

10 Oct, 2012

2 commits


21 Aug, 2012

3 commits

  • Timer internals are protected with irq-safe locks but timer execution
    isn't, so a timer being dequeued for execution and its execution
    aren't atomic against IRQs. This makes it impossible to wait for its
    completion from IRQ handlers and difficult to shoot down a timer from
    IRQ handlers.

    This issue caused some issues for delayed_work interface. Because
    there's no way to reliably shoot down delayed_work->timer from IRQ
    handlers, __cancel_delayed_work() can't share the logic to steal the
    target delayed_work with cancel_delayed_work_sync(), and can only
    steal delayed_works which are on queued on timer. Similarly, the
    pending mod_delayed_work() can't be used from IRQ handlers.

    This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
    to be executed without enabling IRQ after dequeueing such that its
    dequeueing and execution are atomic against IRQ handlers.

    This makes it safe to wait for the timer's completion from IRQ
    handlers, for example, using del_timer_sync(). It can never be
    executing on the local CPU and if executing on other CPUs it won't be
    interrupted until done.

    This will enable simplifying delayed_work cancel/mod interface.

    Signed-off-by: Tejun Heo
    Cc: torvalds@linux-foundation.org
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org
    Signed-off-by: Thomas Gleixner

    Tejun Heo
     
  • Over time, timer initializers became messy with unnecessarily
    duplicated code which are inconsistently spread across timer.h and
    timer.c.

    This patch cleans up timer initializers.

    * timer.c::__init_timer() is renamed to do_init_timer().

    * __TIMER_INITIALIZER() added. It takes @flags and all initializers
    are wrappers around it.

    * init_timer[_on_stack]_key() now take @flags.

    * __init_timer[_on_stack]() added. They take @flags and all init
    macros are wrappers around them.

    * __setup_timer[_on_stack]() added. It uses __init_timer() and takes
    @flags. All setup macros are wrappers around the two.

    Note that this patch doesn't add missing init/setup combinations -
    e.g. init_timer_deferrable_on_stack(). Adding missing ones is
    trivial.

    Signed-off-by: Tejun Heo
    Cc: torvalds@linux-foundation.org
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org
    Signed-off-by: Thomas Gleixner

    Tejun Heo
     
  • To prepare for addition of another flag, generalize timer->base flags
    handling.

    * Rename from TBASE_*_FLAG to TIMER_* and make them LU constants.

    * Define and use TIMER_FLAG_MASK for flags masking so that multiple
    flags can be handled correctly.

    * Don't dereference timer->base directly even if
    !tbase_get_deferrable(). All two such places are already passed in
    @base, so use it instead.

    * Make sure tvec_base's alignment is large enough for timer->base
    flags using BUILD_BUG_ON().

    Signed-off-by: Tejun Heo
    Cc: torvalds@linux-foundation.org
    Cc: peterz@infradead.org
    Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org
    Signed-off-by: Thomas Gleixner

    Tejun Heo
     

19 Aug, 2012

1 commit

  • New helper: current_thread_info(). Allows to do a bunch of odd syscalls
    in C. While we are at it, there had never been a reason to do
    osf_getpriority() in assembler. We also get "namespace"-aware (read:
    consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.

    Signed-off-by: Al Viro
    Signed-off-by: Michael Cree
    Acked-by: Matt Turner
    Signed-off-by: Linus Torvalds

    Al Viro
     

06 Jun, 2012

4 commits

  • Gilad reported at

    http://lkml.kernel.org/r/1336056962-10465-2-git-send-email-gilad@benyossef.com

    "Current timer code fails to correctly return a value meaning that
    there is no future timer event, with the result that the timer keeps
    getting re-armed in HZ one shot mode even when we could turn it off,
    generating unneeded interrupts.

    What is happening is that when __next_timer_interrupt() wishes
    to return a value that signifies "there is no future timer
    event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).

    However, the code in tick_nohz_stop_sched_tick(), which called
    __next_timer_interrupt() via get_next_timer_interrupt(),
    compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
    to see if the timer needs to be re-armed.

    base->timer_jiffies != last_jiffies and so tick_nohz_stop_sched_tick()
    interperts the return value as indication that there is a distant
    future event 12 days from now and programs the timer to fire next
    after KTIME_MAX nsecs instead of avoiding to arm it. This ends up
    causing a needless interrupt once every KTIME_MAX nsecs."

    Fix this by using the new active timer accounting. This avoids scans
    when no active timer is enqueued completely, so we don't have to rely
    on base->timer_next and base->timer_jiffies anymore.

    Reported-by: Gilad Ben-Yossef
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20120525214819.317535385@linutronix.de

    Thomas Gleixner
     
  • The code in get_next_timer_interrupt() is suboptimal as it has to run
    through the cascade to find the next expiring timer. On a completely
    idle core we should only do that when there is an active timer
    enqueued and base->next_timer does not give us a fast answer.

    Add accounting of the active timers to the now consolidated
    attach/detach code. I deliberately avoided sanity checks because the
    code is fully symetric and any fiddling with timers w/o using the API
    functions will lead to cute explosions anyway. ulong is big enough
    even on 32bit and if we really run into the situation to have more
    than 1<
    Cc: Peter Zijlstra
    Cc: Gilad Ben-Yossef
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20120525214819.236377028@linutronix.de

    Thomas Gleixner
     
  • Another bunch of mindlessly copied code. All callers of
    internal_add_timer() except the recascading code updates
    base->next_timer.

    Move this into internal_add_timer() and let the cascading code call
    __internal_add_timer().

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Gilad Ben-Yossef
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20120525214819.189946224@linutronix.de

    Thomas Gleixner
     
  • Most callers of detach_timer() have the same pattern around
    them. Check whether the timer is pending and eventually updating
    base->next_timer.

    Create detach_if_pending() and replace the duplicated code.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Gilad Ben-Yossef
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20120525214819.131246037@linutronix.de

    Thomas Gleixner
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

23 May, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "Nothing exciting. Most are updates to debug stuff and related fixes.
    Two not-too-critical bugs are fixed - WARN_ON() triggering spurious
    during cpu offlining and unlikely lockdep related oops."

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    lockdep: fix oops in processing workqueue
    workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active
    workqueue: Catch more locking problems with flush_work()
    workqueue: change BUG_ON() to WARN_ON()
    trace: Remove unused workqueue tracer

    Linus Torvalds