12 Dec, 2011

1 commit

  • Earlier versions of RCU used the scheduling-clock tick to detect idleness
    by checking for the idle task, but handled idleness differently for
    CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side
    critical sections in the idle task, for example, for tracing. A more
    fine-grained detection of idleness is therefore required.

    This commit presses the old dyntick-idle code into full-time service,
    so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
    always invoked at the beginning of an idle loop iteration. Similarly,
    rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
    at the end of an idle-loop iteration. This allows the idle task to
    use RCU everywhere except between consecutive rcu_idle_enter() and
    rcu_idle_exit() calls, in turn allowing architecture maintainers to
    specify exactly where in the idle loop that RCU may be used.

    Because some of the userspace upcall uses can result in what looks
    to RCU like half of an interrupt, it is not possible to expect that
    the irq_enter() and irq_exit() hooks will give exact counts. This
    patch therefore expands the ->dynticks_nesting counter to 64 bits
    and uses two separate bitfields to count process/idle transitions
    and interrupt entry/exit transitions. It is presumed that userspace
    upcalls do not happen in the idle loop or from usermode execution
    (though usermode might do a system call that results in an upcall).
    The counter is hard-reset on each process/idle transition, which
    avoids the interrupt entry/exit error from accumulating. Overflow
    is avoided by the 64-bitness of the ->dyntick_nesting counter.

    This commit also adds warnings if a non-idle task asks RCU to enter
    idle state (and these checks will need some adjustment before applying
    Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
    In addition, validation of ->dynticks and ->dynticks_nesting is added.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

10 Jun, 2011

1 commit

  • Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
    of preempt count offset independently. So that the offset
    can be updated by preempt_disable() and preempt_enable()
    even without the need for CONFIG_PREEMPT beeing set.

    This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
    with !CONFIG_PREEMPT where it currently doesn't detect
    code that sleeps inside explicit preemption disabled
    sections.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

05 Mar, 2011

1 commit

  • This removes the implementation of the big kernel lock,
    at last. A lot of people have worked on this in the
    past, I so the credit for this patch should be with
    everyone who participated in the hunt.

    The names on the Cc list are the people that were the
    most active in this, according to the recorded git
    history, in alphabetical order.

    Signed-off-by: Arnd Bergmann
    Acked-by: Alan Cox
    Cc: Alessio Igor Bogani
    Cc: Al Viro
    Cc: Andrew Hendry
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Eric W. Biederman
    Cc: Frederic Weisbecker
    Cc: Hans Verkuil
    Acked-by: Ingo Molnar
    Cc: Jan Blunck
    Cc: John Kacur
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Matthew Wilcox
    Cc: Oliver Neukum
    Cc: Paul Menage
    Acked-by: Thomas Gleixner
    Cc: Trond Myklebust

    Arnd Bergmann
     

19 Nov, 2010

1 commit

  • This really isn't the right thing to do, and strictly speaking we should
    have the BKL depth count in the thread info right next to the preempt
    count. The two really do go together.

    However, since that would involve a patch to all architectures, and the
    BKL is finally going away, it's simply not worth the effort to do the
    RightThing(tm). Just re-instate the include that we
    used to get accidentally from the smp_lock.h one.

    This is all fallout from the same old "BKL: remove extraneous #include
    " commit.

    Reported-by: Ingo Molnar
    Tested-by: Randy Dunlap
    Cc: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Nov, 2010

3 commits

  • Commit 451a3c24b013 ("BKL: remove extraneous #include ")
    removed the #include line that was the only thing that was surrounded by
    the #ifdef/#endif.

    So now that #ifdef is guarding nothing at all. Just remove it.

    Reported-by: Byeong-ryeol Kim
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Arnd Bergmann did an automated scripting run to find left-over instances
    of , and had made it trigger it on the normal BKL use
    of lock_kernel and unlock_lernel (and apparently release_kernel_lock and
    reacquire_kernel_lock too, used by the scheduler).

    That resulted in commit 451a3c24b013 ("BKL: remove extraneous #include
    ").

    However, hardirq.h was the only remaining user of the old
    'kernel_locked()' interface, and Arnd's script hadn't checked for that.
    So depending on your configuration and what header files had been
    included, you would get errors like "implicit declaration of function
    'kernel_locked'" during the build.

    The right fix is not to just re-instate the smp_lock.h include - it is
    to just remove 'kernel_locked()' entirely, since the only use was this
    one special low-level detail. Just make hardirq.h do it directly.

    In fact this simplifies and clarifies the code, because some trivial
    analysis makes it clear that hardirq.h only ever used _one_ of the two
    definitions of kernel_locked(), so we can remove the other one entirely.

    Reported-by: Zimny Lech
    Reported-and-acked-by: Randy Dunlap
    Acked-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The big kernel lock has been removed from all these files at some point,
    leaving only the #include.

    Remove this too as a cleanup.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

02 Nov, 2010

1 commit

  • The preempt count logic tries to take the BKL into account, which breaks
    when CONFIG_BKL is not set.

    Use the same preempt_count offset that we use without CONFIG_PREEMPT
    when CONFIG_BKL is disabled.

    Signed-off-by: Arnd Bergmann
    Reported-and-tested-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

22 Oct, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-irqflags:
    Fix IRQ flag handling naming
    MIPS: Add missing #inclusions of
    smc91x: Add missing #inclusion of
    Drop a couple of unnecessary asm/system.h inclusions
    SH: Add missing consts to sys_execve() declaration
    Blackfin: Rename IRQ flags handling functions
    Blackfin: Add missing dep to asm/irqflags.h
    Blackfin: Rename DES PC2() symbol to avoid collision
    Blackfin: Split the BF532 BFIN_*_FIO_FLAG() functions to their own header
    Blackfin: Split PLL code from mach-specific cdef headers

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
    sched: Export account_system_vtime()
    sched: Call tick_check_idle before __irq_enter
    sched: Remove irq time from available CPU power
    sched: Do not account irq time to current task
    x86: Add IRQ_TIME_ACCOUNTING
    sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
    sched: Add a PF flag for ksoftirqd identification
    sched: Consolidate account_system_vtime extern declaration
    sched: Fix softirq time accounting
    sched: Drop group_capacity to 1 only if local group has extra capacity
    sched: Force balancing on newidle balance if local group has capacity
    sched: Set group_imb only a task can be pulled from the busiest cpu
    sched: Do not consider SCHED_IDLE tasks to be cache hot
    sched: Drop all load weight manipulation for RT tasks
    sched: Create special class for stop/migrate work
    sched: Unindent labels
    sched: Comment updates: fix default latency and granularity numbers
    tracing/sched: Add sched_pi_setprio tracepoint
    sched: Give CPU bound RT tasks preference
    sched: Try not to migrate higher priority RT tasks
    ...

    Linus Torvalds
     

19 Oct, 2010

3 commits

  • s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
    the fine granularity accounting of user, system, hardirq, softirq times.
    Adding that option on archs like x86 will be challenging however, given the
    state of TSC reliability on various platforms and also the overhead it will
    add in syscall entry exit.

    Instead, add a lighter variant that only does finer accounting of
    hardirq and softirq times, providing precise irq times (instead of timer tick
    based samples). This accounting is added with a new config option
    CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
    interested in paying the perf penalty.

    This accounting is based on sched_clock, with the code being generic.
    So, other archs may find it useful as well.

    This patch just adds the core logic and does not enable this logic yet.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Just a minor cleanup patch that makes things easier to the following patches.
    No functionality change in this patch.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Peter Zijlstra found a bug in the way softirq time is accounted in
    VIRT_CPU_ACCOUNTING on this thread:

    http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

    The problem is, softirq processing uses local_bh_disable internally. There
    is no way, later in the flow, to differentiate between whether softirq is
    being processed or is it just that bh has been disabled. So, a hardirq when bh
    is disabled results in time being wrongly accounted as softirq.

    Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
    as well. As account_system_time() in normal tick based accouting also uses
    softirq_count, which will be set even when not in softirq with bh disabled.

    Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
    for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
    processing. The patch below does that and adds API in_serving_softirq() which
    returns whether we are currently processing softirq or not.

    Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
    to in_serving_softirq.

    Looks like many usages of in_softirq really want in_serving_softirq. Those
    changes can be made individually on a case by case basis.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

07 Oct, 2010

1 commit

  • Drop inclusions of asm/system.h from linux/hardirq.h and linux/list.h as
    they're no longer required and prevent the M68K arch's IRQ flag handling macros
    from being made into inlined functions due to circular dependencies.

    Signed-off-by: David Howells
    Acked-by: Greg Ungerer
    Acked-by: Geert Uytterhoeven

    David Howells
     

20 Aug, 2010

1 commit

  • Implement a small-memory-footprint uniprocessor-only implementation of
    preemptible RCU. This implementation uses but a single blocked-tasks
    list rather than the combinatorial number used per leaf rcu_node by
    TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
    processing. This version also takes advantage of uniprocessor execution
    to accelerate grace periods in the case where there are no readers.

    The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.

    This implementation is a step towards having RCU implementation driven
    off of the SMP and PREEMPT kernel configuration variables, which can
    happen once this implementation has accumulated sufficient experience.

    Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
    suggested by Steve Rostedt in order to avoid the compiler-reordering
    issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).

    As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
    savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time
    workloads, CONFIG_TINY_RCU is even better.

    CONFIG_TREE_PREEMPT_RCU

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    6170 825 28 7023 kernel/rcutree.o
    ----
    7026 Total

    CONFIG_TINY_PREEMPT_RCU

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    2081 81 8 2170 kernel/rcutiny.o
    ----
    2183 Total

    CONFIG_TINY_RCU (non-preemptible)

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    719 25 0 744 kernel/rcutiny.o
    ---
    757 Total

    Requested-by: Loïc Minier
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

26 Oct, 2009

1 commit

  • This patch is a version of RCU designed for !SMP provided for a
    small-footprint RCU implementation. In particular, the
    implementation of synchronize_rcu() is extremely lightweight and
    high performance. It passes rcutorture testing in each of the
    four relevant configurations (combinations of NO_HZ and PREEMPT)
    on x86. This saves about 1K bytes compared to old Classic RCU
    (which is no longer in mainline), and more than three kilobytes
    compared to Hierarchical RCU (updated to 2.6.30):

    CONFIG_TREE_RCU:

    text data bss dec filename
    183 4 0 187 kernel/rcupdate.o
    2783 520 36 3339 kernel/rcutree.o
    3526 Total (vs 4565 for v7)

    CONFIG_TREE_PREEMPT_RCU:

    text data bss dec filename
    263 4 0 267 kernel/rcupdate.o
    4594 776 52 5422 kernel/rcutree.o
    5689 Total (6155 for v7)

    CONFIG_TINY_RCU:

    text data bss dec filename
    96 4 0 100 kernel/rcupdate.o
    734 24 0 758 kernel/rcutiny.o
    858 Total (vs 848 for v7)

    The above is for x86. Your mileage may vary on other platforms.
    Further compression is possible, but is being procrastinated.

    Changes from v7 (http://lkml.org/lkml/2009/10/9/388)

    o Apply Lai Jiangshan's review comments (aside from
    might_sleep() in synchronize_sched(), which is covered by SMP builds).

    o Fix up expedited primitives.

    Changes from v6 (http://lkml.org/lkml/2009/9/23/293).

    o Forward ported to put it into the 2.6.33 stream.

    o Added lockdep support.

    o Make lightweight rcu_barrier.

    Changes from v5 (http://lkml.org/lkml/2009/6/23/12).

    o Ported to latest pre-2.6.32 merge window kernel.

    - Renamed rcu_qsctr_inc() to rcu_sched_qs().
    - Renamed rcu_bh_qsctr_inc() to rcu_bh_qs().
    - Provided trivial rcu_cpu_notify().
    - Provided trivial exit_rcu().
    - Provided trivial rcu_needs_cpu().
    - Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h.

    o Removed the dependence on EMBEDDED, with a view to making
    TINY_RCU default for !SMP at some time in the future.

    o Added (trivial) support for expedited grace periods.

    Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include:

    o Squeeze the size down a bit further by removing the
    ->completed field from struct rcu_ctrlblk.

    o This permits synchronize_rcu() to become the empty function.
    Previous concerns about rcutorture were unfounded, as
    rcutorture correctly handles a constant value from
    rcu_batches_completed() and rcu_batches_completed_bh().

    Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include:

    o Changed rcu_batches_completed(), rcu_batches_completed_bh()
    rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and
    rcu_nmi_exit(), to be static inlines, as suggested by David
    Howells. Doing this saves about 100 bytes from rcutiny.o.
    (The numbers between v3 and this v4 of the patch are not directly
    comparable, since they are against different versions of Linux.)

    Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include:

    o Fix whitespace issues.

    o Change short-circuit "||" operator to instead be "+" in order
    to fix performance bug noted by "kraai" on LWN.

    (http://lwn.net/Articles/324348/)

    Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include:

    o This version depends on EMBEDDED as well as !SMP, as suggested
    by Ingo.

    o Updated rcu_needs_cpu() to unconditionally return zero,
    permitting the CPU to enter dynticks-idle mode at any time.
    This works because callbacks can be invoked upon entry to
    dynticks-idle mode.

    o Paul is now OK with this being included, based on a poll at
    the Kernel Miniconf at linux.conf.au, where about ten people said
    that they cared about saving 900 bytes on single-CPU systems.

    o Applies to both mainline and tip/core/rcu.

    Signed-off-by: Paul E. McKenney
    Acked-by: David Howells
    Acked-by: Josh Triplett
    Reviewed-by: Lai Jiangshan
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: avi@redhat.com
    Cc: mtosatti@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

12 Sep, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits)
    sched: Fix sched::sched_stat_wait tracepoint field
    sched: Disable NEW_FAIR_SLEEPERS for now
    sched: Keep kthreads at default priority
    sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
    sched: Turn off child_runs_first
    sched: Ensure that a child can't gain time over it's parent after fork()
    sched: enable SD_WAKE_IDLE
    sched: Deal with low-load in wake_affine()
    sched: Remove short cut from select_task_rq_fair()
    sched: Turn on SD_BALANCE_NEWIDLE
    sched: Clean up topology.h
    sched: Fix dynamic power-balancing crash
    sched: Remove reciprocal for cpu_power
    sched: Try to deal with low capacity, fix update_sd_power_savings_stats()
    sched: Try to deal with low capacity
    sched: Scale down cpu_power due to RT tasks
    sched: Implement dynamic cpu_power
    sched: Add smt_gain
    sched: Update the cpu_power sum during load-balance
    sched: Add SD_PREFER_SIBLING
    ...

    Linus Torvalds
     

22 Aug, 2009

1 commit

  • A couple of references to CONFIG_CLASSIC_RCU have survived.
    Although these are harmless, it is past time for them to go.
    The one in hardirq.h is strictly a readability problem.

    The two in pagemap.h appear to disable a !SMP performance
    optimization (which this patch re-enables).

    This does raise the issue as to whether pagemap.h should really
    be referring to the CPU implementation. Long term, I intend to
    make the RCU implementation driven by CONFIG_PREEMPT, at which
    point these should change from defined(CONFIG_TREE_RCU) to
    !defined(CONFIG_PREEMPT). In the meantime, is there something
    else that could be done in pagemap.h?

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: akpm@linux-foundation.org
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josht@linux.vnet.ibm.com
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

09 Aug, 2009

1 commit

  • The PREEMPT_ACTIVE setting doesn't actually need to be
    arch-specific, so set up a sane default for all arches to
    (hopefully) migrate to.

    > if we look at linux/hardirq.h, it makes this claim:
    > * - bit 28 is the PREEMPT_ACTIVE flag
    > if that's true, then why are we letting any arch set this define ? a
    > quick survey shows that half the arches (11) are using 0x10000000 (bit
    > 28) while the other half (10) are using 0x4000000 (bit 26). and then
    > there is the ia64 oddity which uses bit 30. the exact value here
    > shouldnt really matter across arches though should it ?

    actually alpha, arm and avr32 also use bit 30 (0x40000000),
    there are only five (or eight, depending on how you count)
    architectures (blackfin, h8300, m68k, s390 and sparc) using bit
    26.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     

13 Jul, 2009

1 commit

  • * Remove smp_lock.h from files which don't need it (including some headers!)
    * Add smp_lock.h to files which do need it
    * Make smp_lock.h include conditional in hardirq.h
    It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

    This will make hardirq.h inclusion cheaper for every PREEMPT=n config
    (which includes allmodconfig/allyesconfig, BTW)

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

06 Apr, 2009

1 commit


24 Mar, 2009

1 commit

  • Add support for threaded interrupt handlers:

    A device driver can request that its main interrupt handler runs in a
    thread. To achive this the device driver requests the interrupt with
    request_threaded_irq() and provides additionally to the handler a
    thread function. The handler function is called in hard interrupt
    context and needs to check whether the interrupt originated from the
    device. If the interrupt originated from the device then the handler
    can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
    returned when no further action is required. IRQ_WAKE_THREAD causes
    the genirq code to invoke the threaded (main) handler. When
    IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
    on the device level. This is mandatory for shared interrupt handlers,
    but we need to do it as well for obscure x86 hardware where disabling
    an interrupt on the IO_APIC level redirects the interrupt to the
    legacy PIC interrupt lines.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar

    Thomas Gleixner
     

13 Feb, 2009

2 commits

  • Impact: avoid corruption in system time accounting

    Martin Schwidefsky told me that there was an issue with NMIs and
    system accounting. The problem is that the accounting code is
    not reentrant, and if an NMI goes off after an interrupt it can
    corrupt the accounting.

    For now, the best we can do is to treat NMIs like SMIs and they
    are not accounted for.

    This patch changes nmi_enter to not call __irq_enter and to do
    the preempt-count and tracing calls directly.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • To add a bit in the preempt_count to be set when in NMI context, we
    found that some archs did not have enough bits to spare. This is
    due to the hardirq_count being a mask that can hold NR_IRQS.

    Some archs allow for over 16000 IRQs, and that would require a mask
    of 14 bits. The sofitrq mask is 8 bits and the preempt disable mask
    is also 8 bits. The PREEMP_ACTIVE bit is bit 30, and bit 31 would
    make the preempt_count (which is type int) a negative number.
    A negative preempt_count is a sign of failure.

    Add them up 14+8+8+1+1 you get 32 bits. No room for the NMI bit.

    But the hardirq_count is to track the number of nested IRQs, not
    the number of total IRQs. This originally took the paranoid approach
    of setting the max nesting to NR_IRQS. But when we have archs with
    over 1000 IRQs, it is not practical to think they will ever all
    nest on a single CPU. Not to mention that this would most definitely
    cause a stack overflow.

    This patch sets a max of 10 bits to be used for IRQ nesting.
    I did a 'git grep HARDIRQ' to examine all users of HARDIRQ_BITS and
    HARDIRQ_MASK, and found that making it a max of 10 would not hurt
    anyone. I did find that the m68k expected it to be 8 bits, so
    I allow for the archs to set the number to be less than 10.

    I removed the setting of HARDIRQ_BITS from the archs that set it
    to more than 10. This includes ALPHA, ia64 and avr32.

    This will always allow room for the NMI bit, and if we need to allow
    for NMI nesting, we have 4 bits to play with.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

08 Feb, 2009

1 commit

  • This code adds an in_nmi() macro that uses the current tasks preempt count
    to track when it is in NMI context. Other parts of the kernel can
    use this to determine if the context is in NMI context or not.

    This code was inspired by the -rt patch in_nmi version that was
    written by Peter Zijlstra, who borrowed that code from
    Mathieu Desnoyers.

    Reported-by: Andrew Morton
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

31 Dec, 2008

1 commit

  • * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
    stacktrace: provide save_stack_trace_tsk() weak alias
    rcu: provide RCU options on non-preempt architectures too
    printk: fix discarding message when recursion_bug
    futex: clean up futex_(un)lock_pi fault handling
    "Tree RCU": scalable classic RCU implementation
    futex: rename field in futex_q to clarify single waiter semantics
    x86/swiotlb: add default swiotlb_arch_range_needs_mapping
    x86/swiotlb: add default physbus conversion
    x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
    x86: add swiotlb allocation functions
    swiotlb: consolidate swiotlb info message printing
    swiotlb: support bouncing of HighMem pages
    swiotlb: factor out copy to/from device
    swiotlb: add arch hook to force mapping
    swiotlb: allow architectures to override physbusphys conversions
    swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
    rcu: fix rcutorture behavior during reboot
    resources: skip sanity check of busy resources
    swiotlb: move some definitions to header
    swiotlb: allow architectures to override swiotlb pool allocation
    ...

    Fix up trivial conflicts in
    arch/x86/kernel/Makefile
    arch/x86/mm/init_32.c
    include/linux/hardirq.h
    as per Ingo's suggestions.

    Linus Torvalds
     

19 Dec, 2008

1 commit

  • This patch fixes a long-standing performance bug in classic RCU that
    results in massive internal-to-RCU lock contention on systems with
    more than a few hundred CPUs. Although this patch creates a separate
    flavor of RCU for ease of review and patch maintenance, it is intended
    to replace classic RCU.

    This patch still handles stress better than does mainline, so I am still
    calling it ready for inclusion. This patch is against the -tip tree.
    Nevertheless, experience on an actual 1000+ CPU machine would still be
    most welcome.

    Most of the changes noted below were found while creating an rcutiny
    (which should permit ejecting the current rcuclassic) and while doing
    detailed line-by-line documentation.

    Updates from v9 (http://lkml.org/lkml/2008/12/2/334):

    o Fixes from remainder of line-by-line code walkthrough,
    including comment spelling, initialization, undesirable
    narrowing due to type conversion, removing redundant memory
    barriers, removing redundant local-variable initialization,
    and removing redundant local variables.

    I do not believe that any of these fixes address the CPU-hotplug
    issues that Andi Kleen was seeing, but please do give it a whirl
    in case the machine is smarter than I am.

    A writeup from the walkthrough may be found at the following
    URL, in case you are suffering from terminal insomnia or
    masochism:

    http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf

    o Made rcutree tracing use seq_file, as suggested some time
    ago by Lai Jiangshan.

    o Added a .csv variant of the rcudata debugfs trace file, to allow
    people having thousands of CPUs to drop the data into
    a spreadsheet. Tested with oocalc and gnumeric. Updated
    documentation to suit.

    Updates from v8 (http://lkml.org/lkml/2008/11/15/139):

    o Fix a theoretical race between grace-period initialization and
    force_quiescent_state() that could occur if more than three
    jiffies were required to carry out the grace-period
    initialization. Which it might, if you had enough CPUs.

    o Apply Ingo's printk-standardization patch.

    o Substitute local variables for repeated accesses to global
    variables.

    o Fix comment misspellings and redundant (but harmless) increments
    of ->n_rcu_pending (this latter after having explicitly added it).

    o Apply checkpatch fixes.

    Updates from v7 (http://lkml.org/lkml/2008/10/10/291):

    o Fixed a number of problems noted by Gautham Shenoy, including
    the cpu-stall-detection bug that he was having difficulty
    convincing me was real. ;-)

    o Changed cpu-stall detection to wait for ten seconds rather than
    three in order to reduce false positive, as suggested by Ingo
    Molnar.

    o Produced a design document (http://lwn.net/Articles/305782/).
    The act of writing this document uncovered a number of both
    theoretical and "here and now" bugs as noted below.

    o Fix dynticks_nesting accounting confusion, simplify WARN_ON()
    condition, fix kerneldoc comments, and add memory barriers
    in dynticks interface functions.

    o Add more data to tracing.

    o Remove unused "rcu_barrier" field from rcu_data structure.

    o Count calls to rcu_pending() from scheduling-clock interrupt
    to use as a surrogate timebase should jiffies stop counting.

    o Fix a theoretical race between force_quiescent_state() and
    grace-period initialization. Yes, initialization does have to
    go on for some jiffies for this race to occur, but given enough
    CPUs...

    Updates from v6 (http://lkml.org/lkml/2008/9/23/448):

    o Fix a number of checkpatch.pl complaints.

    o Apply review comments from Ingo Molnar and Lai Jiangshan
    on the stall-detection code.

    o Fix several bugs in !CONFIG_SMP builds.

    o Fix a misspelled config-parameter name so that RCU now announces
    at boot time if stall detection is configured.

    o Run tests on numerous combinations of configurations parameters,
    which after the fixes above, now build and run correctly.

    Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):

    o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
    changeset some time ago, and finally got around to retesting
    this option).

    o Fix some tracing bugs in rcupreempt that caused incorrect
    totals to be printed.

    o I now test with a more brutal random-selection online/offline
    script (attached). Probably more brutal than it needs to be
    on the people reading it as well, but so it goes.

    o A number of optimizations and usability improvements:

    o Make rcu_pending() ignore the grace-period timeout when
    there is no grace period in progress.

    o Make force_quiescent_state() avoid going for a global
    lock in the case where there is no grace period in
    progress.

    o Rearrange struct fields to improve struct layout.

    o Make call_rcu() initiate a grace period if RCU was
    idle, rather than waiting for the next scheduling
    clock interrupt.

    o Invoke rcu_irq_enter() and rcu_irq_exit() only when
    idle, as suggested by Andi Kleen. I still don't
    completely trust this change, and might back it out.

    o Make CONFIG_RCU_TRACE be the single config variable
    manipulated for all forms of RCU, instead of the prior
    confusion.

    o Document tracing files and formats for both rcupreempt
    and rcutree.

    Updates from v4 for those missing v5 given its bad subject line:

    o Separated dynticks interface so that NMIs and irqs call separate
    functions, greatly simplifying it. In particular, this code
    no longer requires a proof of correctness. ;-)

    o Separated dynticks state out into its own per-CPU structure,
    avoiding the duplicated accounting.

    o The case where a dynticks-idle CPU runs an irq handler that
    invokes call_rcu() is now correctly handled, forcing that CPU
    out of dynticks-idle mode.

    o Review comments have been applied (thank you all!!!).
    For but one example, fixed the dynticks-ordering issue that
    Manfred pointed out, saving me much debugging. ;-)

    o Adjusted rcuclassic and rcupreempt to handle dynticks changes.

    Attached is an updated patch to Classic RCU that applies a hierarchy,
    greatly reducing the contention on the top-level lock for large machines.
    This passes 10-hour concurrent rcutorture and online-offline testing on
    128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
    bugs in presence of dynticks (exciting working on a system where
    "sleep 1" hangs until interrupted...), which were fixed in the
    2.6.27 kernel. It is getting more reliable than mainline by some
    measures, so the next version will be against -tip for inclusion.
    See also Manfred Spraul's recent patches (or his earlier work from
    2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
    We will converge onto a common patch in the fullness of time, but are
    currently exploring different regions of the design space. That said,
    I have already gratefully stolen quite a few of Manfred's ideas.

    This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
    of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on
    64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
    there is no hierarchy. By default, the RCU initialization code will
    adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
    architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
    this balancing, allowing the hierarchy to be exactly aligned to the
    underlying hardware. Up to two levels of hierarchy are permitted
    (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
    systems and up to 262,144 CPUs on 64-bit systems. I just know that I
    am going to regret saying this, but this seems more than sufficient
    for the foreseeable future. (Some architectures might wish to set
    CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
    If this becomes a real problem, additional levels can be added, but I
    doubt that it will make a significant difference on real hardware.)

    In the common case, a given CPU will manipulate its private rcu_data
    structure and the rcu_node structure that it shares with its immediate
    neighbors. This can reduce both lock and memory contention by multiple
    orders of magnitude, which should eliminate the need for the strange
    manipulations that are reported to be required when running Linux on
    very large systems.

    Some shortcomings:

    o More bugs will probably surface as a result of an ongoing
    line-by-line code inspection.

    Patches will be provided as required.

    o There are probably hangs, rcutorture failures, &c. Seems
    quite stable on a 128-CPU machine, but that is kind of small
    compared to 4096 CPUs. However, seems to do better than
    mainline.

    Patches will be provided as required.

    o The memory footprint of this version is several KB larger
    than rcuclassic.

    A separate UP-only rcutiny patch will be provided, which will
    reduce the memory footprint significantly, even compared
    to the old rcuclassic. One such patch passes light testing,
    and has a memory footprint smaller even than rcuclassic.
    Initial reaction from various embedded guys was "it is not
    worth it", so am putting it aside.

    Credits:

    o Manfred Spraul for ideas, review comments, and bugs spotted,
    as well as some good friendly competition. ;-)

    o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
    Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
    for reviews and comments.

    o Thomas Gleixner for much-needed help with some timer issues
    (see patches below).

    o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
    Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
    Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
    alive despite my heavy abuse^Wtesting.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

07 Nov, 2008

1 commit

  • Impact: moving of function prototypes into own header file

    ftrace.h is too big of a file for hardirq.h, and some archs will fail
    to build because of the include dependencies not being met.

    This patch pulls out the required prototypes for hardirq.h into a smaller
    and safer ftrace_irq.h file.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

03 Nov, 2008

1 commit

  • Impact: build fix for non-ftrace architectures

    Not all archs implement ftrace, and therefore do not have an asm/ftrace.h.
    This patch corrects the problem.

    The ftrace_nmi_enter/exit now must be defined for all archs that implement
    dynamic ftrace. Currently, only x86 does.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

31 Oct, 2008

1 commit

  • Impact: fix crashes that can occur in NMI handlers, if their code is modified

    Modifying code is something that needs special care. On SMP boxes,
    if code that is being modified is also being executed on another CPU,
    that CPU will have undefined results.

    The dynamic ftrace uses kstop_machine to make the system act like a
    uniprocessor system. But this does not address NMIs, that can still
    run on other CPUs.

    One approach to handle this is to make all code that are used by NMIs
    not be traced. But NMIs can call notifiers that spread throughout the
    kernel and this will be very hard to maintain, and the chance of missing
    a function is very high.

    The approach that this patch takes is to have the NMIs modify the code
    if the modification is taking place. The way this works is that just
    writing to code executing on another CPU is not harmful if what is
    written is the same as what exists.

    Two buffers are used: an IP buffer and a "code" buffer.

    The steps that the patcher takes are:

    1) Put in the instruction pointer into the IP buffer
    and the new code into the "code" buffer.
    2) Set a flag that says we are modifying code
    3) Wait for any running NMIs to finish.
    4) Write the code
    5) clear the flag.
    6) Wait for any running NMIs to finish.

    If an NMI is executed, it will also write the pending code.
    Multiple writes are OK, because what is being written is the same.
    Then the patcher must wait for all running NMIs to finish before
    going to the next line that must be patched.

    This is basically the RCU approach to code modification.

    Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
    for his guidence on what is safe and what is not.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

11 May, 2008

1 commit

  • The generic semaphore rewrite had a huge performance regression on AIM7
    (and potentially other BKL-heavy benchmarks) because the generic
    semaphores had been rewritten to be simple to understand and fair. The
    latter, in particular, turns a semaphore-based BKL implementation into a
    mess of scheduling.

    The attempt to fix the performance regression failed miserably (see the
    previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert
    "semaphore: fix"'), and so for now the simple and sane approach is to
    instead just go back to the old spinlock-based BKL implementation that
    never had any issues like this.

    This patch also has the advantage of being reported to fix the
    regression completely according to Yanmin Zhang, unlike the semaphore
    hack which still left a couple percentage point regression.

    As a spinlock, the BKL obviously has the potential to be a latency
    issue, but it's not really any different from any other spinlock in that
    respect. We do want to get rid of the BKL asap, but that has been the
    plan for several years.

    These days, the biggest users are in the tty layer (open/release in
    particular) and Alan holds out some hope:

    "tty release is probably a few months away from getting cured - I'm
    afraid it will almost certainly be the very last user of the BKL in
    tty to get fixed as it depends on everything else being sanely locked."

    so while we're not there yet, we do have a plan of action.

    Tested-by: Yanmin Zhang
    Cc: Ingo Molnar
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Mar, 2008

1 commit


01 Mar, 2008

1 commit

  • The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
    idle CPU will not progress the RCU through its grace period and a
    synchronize_rcu my get stuck. Without this patch I have a box that will
    not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine
    with this patch.

    This patch comes from the -rt kernel where it has been tested for
    several months.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

26 Jan, 2008

1 commit


10 Jul, 2007

1 commit


17 Feb, 2007

2 commits

  • With Ingo Molnar

    Add functions to provide dynamic ticks and high resolution timers. The code
    which keeps track of jiffies and handles the long idle periods is shared
    between tick based and high resolution timer based dynticks. The dyntick
    functionality can be disabled on the kernel commandline. Provide also the
    infrastructure to support high resolution timers.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: john stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Uninline irq_enter(). [dynticks adds more stuff to it]

    No functional changes.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Cc: john stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

04 Oct, 2006

1 commit

  • This patch adds support for systems that cannot receive every interrupt on a
    single cpu simultaneously, in the check to see if we have enough HARDIRQ_BITS.

    MAX_HARDIRQS_PER_CPU becomes the count of the maximum number of hardare
    generated interrupts per cpu.

    On architectures that support per cpu interrupt delivery this can be a
    significant space savings and scalability bonus.

    This patch adds support for systems that cannot receive every interrupt on

    Signed-off-by: Eric W. Biederman
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Rajesh Shah
    Cc: Andi Kleen
    Cc: "Protasevich, Natalie"
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

04 Jul, 2006

2 commits

  • Do 'make oldconfig' and accept all the defaults for new config options -
    reboot into the kernel and if everything goes well it should boot up fine and
    you should have /proc/lockdep and /proc/lockdep_stats files.

    Typically if the lock validator finds some problem it will print out
    voluminous debug output that begins with "BUG: ..." and which syslog output
    can be used by kernel developers to figure out the precise locking scenario.

    What does the lock validator do? It "observes" and maps all locking rules as
    they occur dynamically (as triggered by the kernel's natural use of spinlocks,
    rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
    new locking scenario, it validates this new rule against the existing set of
    rules. If this new rule is consistent with the existing set of rules then the
    new rule is added transparently and the kernel continues as normal. If the
    new rule could create a deadlock scenario then this condition is printed out.

    When determining validity of locking, all possible "deadlock scenarios" are
    considered: assuming arbitrary number of CPUs, arbitrary irq context and task
    context constellations, running arbitrary combinations of all the existing
    locking scenarios. In a typical system this means millions of separate
    scenarios. This is why we call it a "locking correctness" validator - for all
    rules that are observed the lock validator proves it with mathematical
    certainty that a deadlock could not occur (assuming that the lock validator
    implementation itself is correct and its internal data structures are not
    corrupted by some other kernel subsystem). [see more details and conditionals
    of this statement in include/linux/lockdep.h and
    Documentation/lockdep-design.txt]

    Furthermore, this "all possible scenarios" property of the validator also
    enables the finding of complex, highly unlikely multi-CPU multi-context races
    via single single-context rules, increasing the likelyhood of finding bugs
    drastically. In practical terms: the lock validator already found a bug in
    the upstream kernel that could only occur on systems with 3 or more CPUs, and
    which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
    That bug was found and reported on a single-CPU system (!). So in essence a
    race will be found "piecemail-wise", triggering all the necessary components
    for the race, without having to reproduce the race scenario itself! In its
    short existence the lock validator found and reported many bugs before they
    actually caused a real deadlock.

    To further increase the efficiency of the validator, the mapping is not per
    "lock instance", but per "lock-class". For example, all struct inode objects
    in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
    then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
    type", and all locking activities that occur against ->inotify_mutex are
    "unified" into this single lock-class. The advantage of the lock-class
    approach is that all historical ->inotify_mutex uses are mapped into a single
    (and as narrow as possible) set of locking rules - regardless of how many
    different tasks or inode structures it took to build this set of rules. The
    set of rules persist during the lifetime of the kernel.

    To see the rough magnitude of checking that the lock validator does, here's a
    portion of /proc/lockdep_stats, fresh after bootup:

    lock-classes: 694 [max: 2048]
    direct dependencies: 1598 [max: 8192]
    indirect dependencies: 17896
    all direct dependencies: 16206
    dependency chains: 1910 [max: 8192]
    in-hardirq chains: 17
    in-softirq chains: 105
    in-process chains: 1065
    stack-trace entries: 38761 [max: 131072]
    combined max dependencies: 2033928
    hardirq-safe locks: 24
    hardirq-unsafe locks: 176
    softirq-safe locks: 53
    softirq-unsafe locks: 137
    irq-safe locks: 59
    irq-unsafe locks: 176

    The lock validator has observed 1598 actual single-thread locking patterns,
    and has validated all possible 2033928 distinct locking scenarios.

    More details about the design of the lock validator can be found in
    Documentation/lockdep-design.txt, which can also found at:

    http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

    [bunk@stusta.de: cleanups]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Accurate hard-IRQ-flags and softirq-flags state tracing.

    This allows us to attach extra functionality to IRQ flags on/off
    events (such as trace-on/off).

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar