15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

03 Jul, 2013

1 commit

  • Pull core irq changes from Ingo Molnar:
    "The main changes:

    - generic-irqchip driver additions, cleanups and fixes

    - 3 new irqchip drivers: ARMv7-M NVIC, TB10x and Marvell Orion SoCs

    - irq_get_trigger_type() simplification and cross-arch cleanup

    - various cleanups, simplifications

    - documentation updates"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
    softirq: Use _RET_IP_
    genirq: Add the generic chip to the genirq docbook
    genirq: generic-chip: Export some irq_gc_ functions
    genirq: Fix can_request_irq() for IRQs without an action
    irqchip: exynos-combiner: Staticize combiner_init
    irqchip: Add support for ARMv7-M NVIC
    irqchip: Add TB10x interrupt controller driver
    irqdomain: Use irq_get_trigger_type() to get IRQ flags
    MIPS: octeon: Use irq_get_trigger_type() to get IRQ flags
    arm: orion: Use irq_get_trigger_type() to get IRQ flags
    mfd: stmpe: use irq_get_trigger_type() to get IRQ flags
    mfd: twl4030-irq: Use irq_get_trigger_type() to get IRQ flags
    gpio: mvebu: Use irq_get_trigger_type() to get IRQ flags
    genirq: Add irq_get_trigger_type() to get IRQ flags
    genirq: Irqchip: document gcflags arg of irq_alloc_domain_generic_chips
    genirq: Set irq thread to RT priority on creation
    irqchip: Add support for Marvell Orion SoCs
    genirq: Add kerneldoc for irq_disable.
    genirq: irqchip: Add mask to block out invalid irqs
    genirq: Generic chip: Add linear irq domain support
    ...

    Linus Torvalds
     

28 Jun, 2013

1 commit

  • Use the already defined macro to pass the function return address.

    Signed-off-by: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1367347569.1784.3.camel@buesod1.americas.hpqcorp.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

11 Jun, 2013

1 commit

  • The stop machine logic can lock up if all but one of the migration
    threads make it through the disable-irq step and the one remaining
    thread gets stuck in __do_softirq. The reason __do_softirq can hang is
    that it has a bail-out based on jiffies timeout, but in the lockup case,
    jiffies itself is not incremented.

    To work around this, re-add the max_restart counter in __do_irq and stop
    processing irqs after 10 restarts.

    Thanks to Tejun Heo and Rusty Russell and others for helping me track
    this down.

    This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
    latencies").

    It may be worth looking into ath9k to see if it has issues with its irq
    handler at a later date.

    The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11
    Call Trace:
    warn_slowpath_common+0x85/0x9f
    warn_slowpath_fmt+0x46/0x48
    watchdog_overflow_callback+0x9c/0xa7
    __perf_event_overflow+0x137/0x1cb
    perf_event_overflow+0x14/0x16
    intel_pmu_handle_irq+0x2dc/0x359
    perf_event_nmi_handler+0x19/0x1b
    nmi_handle+0x7f/0xc2
    do_nmi+0xbc/0x304
    end_repeat_nmi+0x1e/0x2e
    <>
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:

    __do_softirq+0x117/0x257
    irq_exit+0x5f/0xbb
    smp_apic_timer_interrupt+0x8a/0x98
    apic_timer_interrupt+0x72/0x80

    printk+0x4d/0x4f
    stop_machine_cpu_stop+0x22c/0x274
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0

    Signed-off-by: Ben Greear
    Acked-by: Tejun Heo
    Acked-by: Pekka Riikonen
    Cc: Eric Dumazet
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Ben Greear
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

02 May, 2013

1 commit


01 May, 2013

1 commit


23 Apr, 2013

1 commit

  • Eventually try to disable tick on irq exit, now that the
    fundamental infrastructure is in place.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

03 Apr, 2013

1 commit

  • We are planning to convert the dynticks Kconfig options layout
    into a choice menu. The user must be able to easily pick
    any of the following implementations: constant periodic tick,
    idle dynticks, full dynticks.

    As this implies a mutual exclusion, the two dynticks implementions
    need to converge on the selection of a common Kconfig option in order
    to ease the sharing of a common infrastructure.

    It would thus seem pretty natural to reuse CONFIG_NO_HZ to
    that end. It already implements all the idle dynticks code
    and the full dynticks depends on all that code for now.
    So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
    CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.

    On the other hand we want to stay backward compatible: if
    CONFIG_NO_HZ is set in an older config file, we want to
    enable CONFIG_NO_HZ_IDLE by default.

    But we can't afford both at the same time or we run into
    a circular dependency:

    1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
    CONFIG_NO_HZ
    2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE

    We might be able to support that from Kconfig/Kbuild but it
    may not be wise to introduce such a confusing behaviour.

    So to solve this, create a new CONFIG_NO_HZ_COMMON option
    which gathers the common code between idle and full dynticks
    (that common code for now is simply the idle dynticks code)
    and select it from their referring Kconfig.

    Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
    to it for backward compatibility.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

06 Mar, 2013

1 commit

  • Pull irq fixes and cleanups from Thomas Gleixner:
    "Commit e5ab012c3271 ("nohz: Make tick_nohz_irq_exit() irq safe") is
    the first commit in the series and the minimal necessary bugfix, which
    needs to go back into stable.

    The remanining commits enforce irq disabling in irq_exit(), sanitize
    the hardirq/softirq preempt count transition and remove a bunch of no
    longer necessary conditionals."

    I personally love getting rid of the very subtle and confusing
    IRQ_EXIT_OFFSET thing. Even apart from the whole "more lines removed
    than added" thing.

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irq: Don't re-enable interrupts at the end of irq_exit
    irq: Remove IRQ_EXIT_OFFSET workaround
    Revert "nohz: Make tick_nohz_irq_exit() irq safe"
    irq: Sanitize invoke_softirq
    irq: Ensure irq_exit() code runs with interrupts disabled
    nohz: Make tick_nohz_irq_exit() irq safe

    Linus Torvalds
     

01 Mar, 2013

1 commit

  • Commit 74eed0163d0def3fce27228d9ccf3d36e207b286
    "irq: Ensure irq_exit() code runs with interrupts disabled"
    restore interrupts flags in the end of irq_exit() for archs
    that don't define __ARCH_IRQ_EXIT_IRQS_DISABLED.

    However always returning from irq_exit() with interrupts
    disabled should not be a problem for these archs. Prior to
    this commit this was already happening anytime we processed
    pending softirqs anyway.

    Suggested-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney

    Frederic Weisbecker
     

22 Feb, 2013

3 commits

  • The IRQ_EXIT_OFFSET trick was used to make sure the irq
    doesn't get preempted after we substract the HARDIRQ_OFFSET
    until we are entirely done with any code in irq_exit().

    This workaround was necessary because some archs may call
    irq_exit() with irqs enabled and there is still some code
    in the end of this function that is not covered by the
    HARDIRQ_OFFSET but want to stay non-preemptible.

    Now that irq are always disabled in irq_exit(), the whole code
    is guaranteed not to be preempted. We can thus remove this hack.

    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney

    Frederic Weisbecker
     
  • With the irq protection in irq_exit, we can remove the #ifdeffery and
    the bh_disable/enable dance in invoke_softirq()

    Signed-off-by: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos

    Thomas Gleixner
     
  • We had already a few problems with code called from irq_exit() when
    interrupted from a nesting interrupt. This can happen on architectures
    which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED.

    __ARCH_IRQ_EXIT_IRQS_DISABLED should go away and we want to make it
    mandatory to call irq_exit() with interrupts disabled.

    As a temporary protection disable interrupts for those architectures
    which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED and add a WARN_ONCE
    when an architecture which defines __ARCH_IRQ_EXIT_IRQS_DISABLED calls
    irq_exit() with interrupts enabled.

    Signed-off-by: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos

    Thomas Gleixner
     

21 Feb, 2013

1 commit

  • Pull networking update from David Miller:

    1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
    timestamp offset. From Andrey Vagin.

    2) VMWARE VM VSOCK layer, from Andy King.

    3) Much improved support for virtual functions and SR-IOV in bnx2x,
    from Ariel ELior.

    4) All protocols on ipv4 and ipv6 are now network namespace aware, and
    all the compatability checks for initial-namespace-only protocols is
    removed. Thanks to Tom Parkin for helping deal with the last major
    holdout, L2TP.

    5) IPV6 support in netpoll and network namespace support in pktgen,
    from Cong Wang.

    6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
    Protocol (MVRP) support, from David Ward.

    7) Compute packet lengths more accurately in the packet scheduler, from
    Eric Dumazet.

    8) Use per-task page fragment allocator in skb_append_datato_frags(),
    also from Eric Dumazet.

    9) Add support for connection tracking labels in netfilter, from
    Florian Westphal.

    10) Fix default multicast group joining on ipv6, and add anti-spoofing
    checks to 6to4 and 6rd. From Hannes Frederic Sowa.

    11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
    times, rearrange inet frag datastructures for better cacheline
    locality, and move more operations outside of locking. From Jesper
    Dangaard Brouer.

    12) Instead of strict master slave relationships, allow arbitrary
    scenerios with "upper device lists". From Jiri Pirko.

    13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
    Pirko.

    14) Add a BPF filter netfilter match target, from Willem de Bruijn.

    15) Orphan and delete a bunch of pre-historic networking drivers from
    Paul Gortmaker.

    16) Add TSO support for GRE tunnels, from Pravin B SHelar. Although
    this still needs some minor bug fixing before it's %100 correct in
    all cases.

    17) Handle unresolved IPSEC states like ARP, with a resolution packet
    queue. From Steffen Klassert.

    18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
    Hemminger. This was long overdue.

    19) Support SO_REUSEPORT, from Tom Herbert.

    20) Allow locking a socket BPF filter, so that it cannot change after a
    process drops capabilities.

    21) Add VLAN filtering to bridge, from Vlad Yasevich.

    22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
    the ipv6 routes, from YOSHIFUJI Hideaki.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
    ipv6: fix race condition regarding dst->expires and dst->from.
    net: fix a wrong assignment in skb_split()
    ip_gre: remove an extra dst_release()
    ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
    atl1c: restore buffer state
    net: fix a build failure when !CONFIG_PROC_FS
    net: ipv4: fix waring -Wunused-variable
    net: proc: fix build failed when procfs is not configured
    Revert "xen: netback: remove redundant xenvif_put"
    net: move procfs code to net/core/net-procfs.c
    qmi_wwan, cdc-ether: add ADU960S
    bonding: set sysfs device_type to 'bond'
    bonding: fix bond_release_all inconsistencies
    b44: use netdev_alloc_skb_ip_align()
    xen: netback: remove redundant xenvif_put
    net: fec: Do a sanity check on the gpio number
    ip_gre: propogate target device GSO capability to the tunnel device
    ip_gre: allow CSUM capable devices to handle packets
    bonding: Fix initialize after use for 3ad machine state spinlock
    bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
    ...

    Linus Torvalds
     

28 Jan, 2013

1 commit

  • While remotely reading the cputime of a task running in a
    full dynticks CPU, the values stored in utime/stime fields
    of struct task_struct may be stale. Its values may be those
    of the last kernel user transition time snapshot and
    we need to add the tickless time spent since this snapshot.

    To fix this, flush the cputime of the dynticks CPUs on
    kernel user transition and record the time / context
    where we did this. Then on top of this snapshot and the current
    time, perform the fixup on the reader side from task_times()
    accessors.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    [fixed kvm module related build errors]
    Signed-off-by: Sedat Dilek

    Frederic Weisbecker
     

11 Jan, 2013

1 commit

  • In various network workloads, __do_softirq() latencies can be up
    to 20 ms if HZ=1000, and 200 ms if HZ=100.

    This is because we iterate 10 times in the softirq dispatcher,
    and some actions can consume a lot of cycles.

    This patch changes the fallback to ksoftirqd condition to :

    - A time limit of 2 ms.
    - need_resched() being set on current task

    When one of this condition is met, we wakeup ksoftirqd for further
    softirq processing if we still have pending softirqs.

    Using need_resched() as the only condition can trigger RCU stalls,
    as we can keep BH disabled for too long.

    I ran several benchmarks and got no significant difference in
    throughput, but a very significant reduction of latencies (one order
    of magnitude) :

    In following bench, 200 antagonist "netperf -t TCP_RR" are started in
    background, using all available cpus.

    Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
    IRQ (hard+soft)

    Before patch :

    # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k
    RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
    to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
    RT_LATENCY=550110.424
    MIN_LATENCY=146858
    MAX_LATENCY=997109
    P50_LATENCY=305000
    P90_LATENCY=550000
    P99_LATENCY=710000
    MEAN_LATENCY=376989.12
    STDDEV_LATENCY=184046.92

    After patch :

    # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k
    RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
    to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
    RT_LATENCY=40545.492
    MIN_LATENCY=9834
    MAX_LATENCY=78366
    P50_LATENCY=33583
    P90_LATENCY=59000
    P99_LATENCY=69000
    MEAN_LATENCY=38364.67
    STDDEV_LATENCY=12865.26

    Signed-off-by: Eric Dumazet
    Cc: David Miller
    Cc: Tom Herbert
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2012

1 commit

  • With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
    is called in irq entry/exit, we perform a check on the
    context: if we are interrupting the idle task we
    account the pending cputime to idle, otherwise account
    to system time or its sub-areas: tsk->stime, hardirq time,
    softirq time, ...

    However this check for idle only concerns the hardirq entry
    and softirq entry:

    * Hardirq may directly interrupt the idle task, in which case
    we need to flush the pending CPU time to idle.

    * The idle task may be directly interrupted by a softirq if
    it calls local_bh_enable(). There is probably no such call
    in any idle task but we need to cover every case. Ksoftirqd
    is not concerned because the idle time is flushed on context
    switch and softirq in the end of hardirq have the idle time
    already flushed from the hardirq entry.

    In the other cases we always account to system/irq time:

    * On hardirq exit we account the time to hardirq time.
    * On softirq exit we account the time to softirq time.

    To optimize this and avoid the indirect call to vtime_account()
    and the checks it performs, specialize the vtime irq APIs and
    only perform the check on irq entry. Irq exit can directly call
    vtime_account_system().

    CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
    maps to its own vtime_account() implementation. One may want
    to take benefits from the new APIs to optimize irq time accounting
    as well in the future.

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

02 Oct, 2012

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

25 Sep, 2012

1 commit

  • Use a naming based on vtime as a prefix for virtual based
    cputime accounting APIs:

    - account_system_vtime() -> vtime_account()
    - account_switch_vtime() -> vtime_task_switch()

    It makes it easier to allow for further declension such
    as vtime_account_system(), vtime_account_idle(), ... if we
    want to find out the context we account to from generic code.

    This also make it better to know on which subsystem these APIs
    refer to.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

13 Aug, 2012

1 commit

  • [ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Srivatsa S. Bhat
    Cc: Rusty Russell
    Reviewed-by: Paul E. McKenney
    Cc: Namhyung Kim
    Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

01 Aug, 2012

1 commit

  • This is needed to allow network softirq packet processing to make use of
    PF_MEMALLOC.

    Currently softirq context cannot use PF_MEMALLOC due to it not being
    associated with a task, and therefore not having task flags to fiddle with
    - thus the gfp to alloc flag mapping ignores the task flags when in
    interrupts (hard or soft) context.

    Allowing softirqs to make use of PF_MEMALLOC therefore requires some
    trickery. This patch borrows the task flags from whatever process happens
    to be preempted by the softirq. It then modifies the gfp to alloc flags
    mapping to not exclude task flags in softirq context, and modify the
    softirq code to save, clear and restore the PF_MEMALLOC flag.

    The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
    leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
    cannot leak back into the preempted process. This should be safe due to
    the following reasons

    Softirqs can run on multiple CPUs sure but the same task should not be
    executing the same softirq code. Neither should the softirq
    handler be preempted by any other softirq handler so the flags
    should not leak to an unrelated softirq.

    Softirqs re-enable hardware interrupts in __do_softirq() so can be
    preempted by hardware interrupts so PF_MEMALLOC is inherited
    by the hard IRQ. However, this is similar to a process in
    reclaim being preempted by a hardirq. While PF_MEMALLOC is
    set, gfp_to_alloc_flags() distinguishes between hard and
    soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
    flag.

    If the softirq is deferred to ksoftirq then its flags may be used
    instead of a normal tasks but as the softirq cannot be preempted,
    the PF_MEMALLOC flag does not leak to other code by accident.

    [davem@davemloft.net: Document why PF_MEMALLOC is safe]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 Mar, 2012

3 commits

  • Pull timer changes for v3.4 from Ingo Molnar

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    ntp: Fix integer overflow when setting time
    math: Introduce div64_long
    cs5535-clockevt: Allow the MFGPT IRQ to be shared
    cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels
    x86/time: Eliminate unused irq0_irqs counter
    clocksource: scx200_hrt: Fix the build
    x86/tsc: Reduce the TSC sync check time for core-siblings
    timer: Fix bad idle check on irq entry
    nohz: Remove ts->Einidle checks before restarting the tick
    nohz: Remove update_ts_time_stat from tick_nohz_start_idle
    clockevents: Leave the broadcast device in shutdown mode when not needed
    clocksource: Load the ACPI PM clocksource asynchronously
    clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz
    clocksource: Get rid of clocksource_calc_mult_shift()
    clocksource: dbx500: convert to clocksource_register_hz()
    clocksource: scx200_hrt: use pr_ instead of printk
    time: Move common updates to a function
    time: Reorder so the hot data is together
    time: Remove most of xtime_lock usage in timekeeping.c
    ntp: Add ntp_lock to replace xtime_locking
    ...

    Linus Torvalds
     
  • Pull scheduler changes for v3.4 from Ingo Molnar

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    printk: Make it compile with !CONFIG_PRINTK
    sched/x86: Fix overflow in cyc2ns_offset
    sched: Fix nohz load accounting -- again!
    sched: Update yield() docs
    printk/sched: Introduce special printk_sched() for those awkward moments
    sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
    sched: Cleanup cpu_active madness
    sched: Fix load-balance wreckage
    sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
    sched: Ditch per cgroup task lists for load-balancing
    sched: Rename load-balancing fields
    sched: Move load-balancing arguments into helper struct
    sched/rt: Do not submit new work when PI-blocked
    sched/rt: Prevent idle task boosting
    sched/wait: Add __wake_up_all_locked() API
    sched/rt: Document scheduler related skip-resched-check sites
    sched/rt: Use schedule_preempt_disabled()
    sched/rt: Add schedule_preempt_disabled()
    sched/rt: Do not throttle when PI boosting
    sched/rt: Keep period timer ticking when rt throttling is active
    ...

    Linus Torvalds
     
  • Pull perf events changes for v3.4 from Ingo Molnar:

    - New "hardware based branch profiling" feature both on the kernel and
    the tooling side, on CPUs that support it. (modern x86 Intel CPUs
    with the 'LBR' hardware feature currently.)

    This new feature is basically a sophisticated 'magnifying glass' for
    branch execution - something that is pretty difficult to extract from
    regular, function histogram centric profiles.

    The simplest mode is activated via 'perf record -b', and the result
    looks like this in perf report:

    $ perf record -b any_call,u -e cycles:u branchy

    $ perf report -b --sort=symbol
    52.34% [.] main [.] f1
    24.04% [.] f1 [.] f3
    23.60% [.] f1 [.] f2
    0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
    0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
    0.01% [k] _IO_vfprintf_internal [k] strchrnul
    0.01% [k] __printf [k] _IO_vfprintf_internal
    0.01% [k] main [k] __printf

    This output shows from/to branch columns and shows the highest
    percentage (from,to) jump combinations - i.e. the most likely taken
    branches in the system. "branches" can also include function calls
    and any other synchronous and asynchronous transitions of the
    instruction pointer that are not 'next instruction' - such as system
    calls, traps, interrupts, etc.

    This feature comes with (hopefully intuitive) flat ascii and TUI
    support in perf report.

    - Various 'perf annotate' visual improvements for us assembly junkies.
    It will now recognize function calls in the TUI and by hitting enter
    you can follow the call (recursively) and back, amongst other
    improvements.

    - Multiple threads/processes recording support in perf record, perf
    stat, perf top - which is activated via a comma-list of PIDs:

    perf top -p 21483,21485
    perf stat -p 21483,21485 -ddd
    perf record -p 21483,21485

    - Support for per UID views, via the --uid paramter to perf top, perf
    report, etc. For example 'perf top --uid mingo' will only show the
    tasks that I am running, excluding other users, root, etc.

    - Jump label restructurings and improvements - this includes the
    factoring out of the (hopefully much clearer) include/linux/static_key.h
    generic facility:

    struct static_key key = STATIC_KEY_INIT_FALSE;

    ...

    if (static_key_false(&key))
    do unlikely code
    else
    do likely code

    ...
    static_key_slow_inc();
    ...
    static_key_slow_inc();
    ...

    The static_key_false() branch will be generated into the code with as
    little impact to the likely code path as possible. the
    static_key_slow_*() APIs flip the branch via live kernel code patching.

    This facility can now be used more widely within the kernel to
    micro-optimize hot branches whose likelihood matches the static-key
    usage and fast/slow cost patterns.

    - SW function tracer improvements: perf support and filtering support.

    - Various hardenings of the perf.data ABI, to make older perf.data's
    smoother on newer tool versions, to make new features integrate more
    smoothly, to support cross-endian recording/analyzing workflows
    better, etc.

    - Restructuring of the kprobes code, the splitting out of 'optprobes',
    and a corner case bugfix.

    - Allow the tracing of kernel console output (printk).

    - Improvements/fixes to user-space RDPMC support, allowing user-space
    self-profiling code to extract PMU counts without performing any
    system calls, while playing nice with the kernel side.

    - 'perf bench' improvements

    - ... and lots of internal restructurings, cleanups and fixes that made
    these features possible. And, as usual this list is incomplete as
    there were also lots of other improvements

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
    perf report: Fix annotate double quit issue in branch view mode
    perf report: Remove duplicate annotate choice in branch view mode
    perf/x86: Prettify pmu config literals
    perf report: Enable TUI in branch view mode
    perf report: Auto-detect branch stack sampling mode
    perf record: Add HEADER_BRANCH_STACK tag
    perf record: Provide default branch stack sampling mode option
    perf tools: Make perf able to read files from older ABIs
    perf tools: Fix ABI compatibility bug in print_event_desc()
    perf tools: Enable reading of perf.data files from different ABI rev
    perf: Add ABI reference sizes
    perf report: Add support for taken branch sampling
    perf record: Add support for sampling taken branch
    perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
    x86/kprobes: Split out optprobe related code to kprobes-opt.c
    x86/kprobes: Fix a bug which can modify kernel code permanently
    x86/kprobes: Fix instruction recovery on optimized path
    perf: Add callback to flush branch_stack on context switch
    perf: Disable PERF_SAMPLE_BRANCH_* when not supported
    perf/x86: Add LBR software filter support for Intel CPUs
    ...

    Linus Torvalds
     

06 Mar, 2012

1 commit


01 Mar, 2012

2 commits

  • Create a distinction between scheduler related preempt_enable_no_resched()
    calls and the nearly one hundred other places in the kernel that do not
    want to reschedule, for one reason or another.

    This distinction matters for -rt, where the scheduler and the non-scheduler
    preempt models (and checks) are different. For upstream it's purely
    documentational.

    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-gs88fvx2mdv5psnzxnv575ke@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Coccinelle based conversion.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-24swm5zut3h9c4a6s46x8rws@git.kernel.org
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

15 Feb, 2012

1 commit

  • idle_cpu() is called on irq entry to guess if we need to call
    tick_check_idle(). This way we can catch up with jiffies if the tick
    was stopped, stop accounting idle time during the interrupt and
    maintain the sched clock if it is unstable.

    But if we are going to exit the idle loop to schedule a new task (ie:
    if we have a task in the runqueue or a remotely enqueued ttwu to
    perform), the idle_cpu() check will return 0 such that we miss the
    call to tick_check_idle() for all interrupts happening before we
    schedule the new task.

    As a result these interrupts and the softirqs coming along may deal
    with stale jiffies values, bad sched clock values, and won't substract
    their time from the idle time accounting.

    Fix this with using is_idle_task() instead that strictly checks that
    we are running the idle task, without caring about the fact we are
    going to schedule a task soon.

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: John Stultz
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/1327427984-23282-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Thomas Gleixner

    Frederic Weisbecker
     

03 Feb, 2012

1 commit

  • The __raise_softirq_irqoff() contains a tracepoint. As tracepoints in headers
    can cause issues, and not to mention, bloats the kernel when they are
    in a static inline, it is best to move the function that contains the
    tracepoint out of the header and into softirq.c.

    Link: http://lkml.kernel.org/r/20120118120711.GB14863@elte.hu

    Suggested-by: Ingo Molnar
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

12 Dec, 2011

2 commits

  • On the irq exit path, tick_nohz_irq_exit()
    may raise a softirq, which action leads to the wake up
    path and select_task_rq_fair() that makes use of rcu
    to iterate the domains.

    This is an illegal use of RCU because we may be in RCU
    extended quiescent state if we interrupted an RCU-idle
    window in the idle loop:

    [ 132.978883] ===============================
    [ 132.978883] [ INFO: suspicious RCU usage. ]
    [ 132.978883] -------------------------------
    [ 132.978883] kernel/sched_fair.c:1707 suspicious rcu_dereference_check() usage!
    [ 132.978883]
    [ 132.978883] other info that might help us debug this:
    [ 132.978883]
    [ 132.978883]
    [ 132.978883] rcu_scheduler_active = 1, debug_locks = 0
    [ 132.978883] RCU used illegally from extended quiescent state!
    [ 132.978883] 2 locks held by swapper/0:
    [ 132.978883] #0: (&p->pi_lock){-.-.-.}, at: [] try_to_wake_up+0x39/0x2f0
    [ 132.978883] #1: (rcu_read_lock){.+.+..}, at: [] select_task_rq_fair+0x6a/0xec0
    [ 132.978883]
    [ 132.978883] stack backtrace:
    [ 132.978883] Pid: 0, comm: swapper Tainted: G W 3.0.0+ #178
    [ 132.978883] Call Trace:
    [ 132.978883] [] lockdep_rcu_suspicious+0xe6/0x100
    [ 132.978883] [] select_task_rq_fair+0x749/0xec0
    [ 132.978883] [] ? select_task_rq_fair+0x6a/0xec0
    [ 132.978883] [] ? do_raw_spin_lock+0x54/0x150
    [ 132.978883] [] ? trace_hardirqs_on+0xd/0x10
    [ 132.978883] [] try_to_wake_up+0xd3/0x2f0
    [ 132.978883] [] ? ktime_get+0x68/0xf0
    [ 132.978883] [] wake_up_process+0x15/0x20
    [ 132.978883] [] raise_softirq_irqoff+0x65/0x110
    [ 132.978883] [] __hrtimer_start_range_ns+0x415/0x5a0
    [ 132.978883] [] ? do_raw_spin_unlock+0x5e/0xb0
    [ 132.978883] [] hrtimer_start+0x18/0x20
    [ 132.978883] [] tick_nohz_stop_sched_tick+0x393/0x450
    [ 132.978883] [] irq_exit+0xd2/0x100
    [ 132.978883] [] do_IRQ+0x66/0xe0
    [ 132.978883] [] common_interrupt+0x13/0x13
    [ 132.978883] [] ? native_safe_halt+0xb/0x10
    [ 132.978883] [] ? trace_hardirqs_on+0xd/0x10
    [ 132.978883] [] default_idle+0xba/0x370
    [ 132.978883] [] amd_e400_idle+0x5e/0x130
    [ 132.978883] [] cpu_idle+0xb6/0x120
    [ 132.978883] [] rest_init+0xef/0x150
    [ 132.978883] [] ? rest_init+0x52/0x150
    [ 132.978883] [] start_kernel+0x3da/0x3e5
    [ 132.978883] [] x86_64_start_reservations+0x131/0x135
    [ 132.978883] [] x86_64_start_kernel+0x103/0x112

    Fix this by calling rcu_idle_enter() after tick_nohz_irq_exit().

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • The tick_nohz_stop_sched_tick() function, which tries to delay
    the next timer tick as long as possible, can be called from two
    places:

    - From the idle loop to start the dytick idle mode
    - From interrupt exit if we have interrupted the dyntick
    idle mode, so that we reprogram the next tick event in
    case the irq changed some internal state that requires this
    action.

    There are only few minor differences between both that
    are handled by that function, driven by the ts->inidle
    cpu variable and the inidle parameter. The whole guarantees
    that we only update the dyntick mode on irq exit if we actually
    interrupted the dyntick idle mode, and that we enter in RCU extended
    quiescent state from idle loop entry only.

    Split this function into:

    - tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
    dynticks idle mode unconditionally if it can, and enters into RCU
    extended quiescent state.

    - tick_nohz_irq_exit() which only updates the dynticks idle mode
    when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).

    To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
    into tick_nohz_idle_exit().

    This simplifies the code and micro-optimize the irq exit path (no need
    for local_irq_save there). This also prepares for the split between
    dynticks and rcu extended quiescent state logics. We'll need this split to
    further fix illegal uses of RCU in extended quiescent states in the idle
    loop.

    Signed-off-by: Frederic Weisbecker
    Cc: Mike Frysinger
    Cc: Guan Xuetao
    Cc: David Miller
    Cc: Chris Metcalf
    Cc: Hans-Christian Egtvedt
    Cc: Ralf Baechle
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Russell King
    Cc: Paul Mackerras
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

21 Jul, 2011

1 commit

  • The rcu_read_unlock_special() function relies on in_irq() to exclude
    scheduler activity from interrupt level. This fails because exit_irq()
    can invoke the scheduler after clearing the preempt_count() bits that
    in_irq() uses to determine that it is at interrupt level. This situation
    can result in failures as follows:

    $task IRQ SoftIRQ

    rcu_read_lock()

    /* do stuff */

    |= UNLOCK_BLOCKED

    rcu_read_unlock()
    --t->rcu_read_lock_nesting

    irq_enter();
    /* do stuff, don't use RCU */
    irq_exit();
    sub_preempt_count(IRQ_EXIT_OFFSET);
    invoke_softirq()

    ttwu();
    spin_lock_irq(&pi->lock)
    rcu_read_lock();
    /* do stuff */
    rcu_read_unlock();
    rcu_read_unlock_special()
    rcu_report_exp_rnp()
    ttwu()
    spin_lock_irq(&pi->lock) /* deadlock */

    rcu_read_unlock_special(t);

    Ed can simply trigger this 'easy' because invoke_softirq() immediately
    does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
    first, but even without that the above happens.

    Cure this by also excluding softirqs from the
    rcu_read_unlock_special() handler and ensuring the force_irqthreads
    ksoftirqd/# wakeup is done from full softirq context.

    [ Alternatively, delaying the ->rcu_read_lock_nesting decrement
    until after the special handling would make the thing more robust
    in the face of interrupts as well. And there is a separate patch
    for that. ]

    Cc: Thomas Gleixner
    Reported-and-tested-by: Ed Tomlinson
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     

15 Jun, 2011

1 commit

  • Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
    introduced performance regression. In an AIM7 test, this commit degraded
    performance by about 40%.

    The commit runs rcu callbacks in a kthread instead of softirq. We observed
    high rate of context switch which is caused by this. Out test system has
    64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
    which is caused by RCU's per-CPU kthread. A trace showed that most of
    the time the RCU per-CPU kthread doesn't actually handle any callbacks,
    but instead just does a very small amount of work handling grace periods.
    This means that RCU's per-CPU kthreads are making the scheduler do quite
    a bit of work in order to allow a very small amount of RCU-related
    processing to be done.

    Alex Shi's analysis determined that this slowdown is due to lock
    contention within the scheduler. Unfortunately, as Peter Zijlstra points
    out, the scheduler's real-time semantics require global action, which
    means that this contention is inherent in real-time scheduling. (Yes,
    perhaps someone will come up with a workaround -- otherwise, -rt is not
    going to do well on large SMP systems -- but this patch will work around
    this issue in the meantime. And "the meantime" might well be forever.)

    This patch therefore re-introduces softirq processing to RCU, but only
    for core RCU work. RCU callbacks are still executed in kthread context,
    so that only a small amount of RCU work runs in softirq context in the
    common case. This should minimize ksoftirqd execution, allowing us to
    skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.

    Signed-off-by: Shaohua Li
    Tested-by: "Alex,Shi"
    Signed-off-by: Paul E. McKenney

    Shaohua Li
     

06 May, 2011

1 commit

  • If RCU priority boosting is to be meaningful, callback invocation must
    be boosted in addition to preempted RCU readers. Otherwise, in presence
    of CPU real-time threads, the grace period ends, but the callbacks don't
    get invoked. If the callbacks don't get invoked, the associated memory
    doesn't get freed, so the system is still subject to OOM.

    But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
    moves the callback invocations to a kthread, which can be boosted easily.

    Also add comments and properly synchronized all accesses to
    rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

31 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • ksoftirqd, kworker, migration, and pktgend kthreads can be created with
    kthread_create_on_node(), to get proper NUMA affinities for their stack and
    task_struct.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Acked-by: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

16 Mar, 2011

1 commit

  • * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (116 commits)
    x86: Enable forced interrupt threading support
    x86: Mark low level interrupts IRQF_NO_THREAD
    x86: Use generic show_interrupts
    x86: ioapic: Avoid redundant lookup of irq_cfg
    x86: ioapic: Use new move_irq functions
    x86: Use the proper accessors in fixup_irqs()
    x86: ioapic: Use irq_data->state
    x86: ioapic: Simplify irq chip and handler setup
    x86: Cleanup the genirq name space
    genirq: Add chip flag to force mask on suspend
    genirq: Add desc->irq_data accessor
    genirq: Add comments to Kconfig switches
    genirq: Fixup fasteoi handler for oneshot mode
    genirq: Provide forced interrupt threading
    sched: Switch wait_task_inactive to schedule_hrtimeout()
    genirq: Add IRQF_NO_THREAD
    genirq: Allow shared oneshot interrupts
    genirq: Prepare the handling of shared oneshot interrupts
    genirq: Make warning in handle_percpu_event useful
    x86: ioapic: Move trigger defines to io_apic.h
    ...

    Fix up trivial(?) conflicts in arch/x86/pci/xen.c due to genirq name
    space changes clashing with the Xen cleanups. The set_irq_msi() had
    moved to xen_bind_pirq_msi_to_irq().

    Linus Torvalds
     

26 Feb, 2011

1 commit

  • Add a commandline parameter "threadirqs" which forces all interrupts except
    those marked IRQF_NO_THREAD to run threaded. That's mostly a debug option to
    allow retrieving better debug data from crashing interrupt handlers. If
    "threadirqs" is not enabled on the kernel command line, then there is no
    impact in the interrupt hotpath.

    Architecture code needs to select CONFIG_IRQ_FORCED_THREADING after
    marking the interrupts which cant be threaded IRQF_NO_THREAD. All
    interrupts which have IRQF_TIMER set are implict marked
    IRQF_NO_THREAD. Also all PER_CPU interrupts are excluded.

    Forced threading hard interrupts also forces all soft interrupt
    handling into thread context.

    When enabled it might slow down things a bit, but for debugging problems in
    interrupt code it's a reasonable penalty as it does not immediately
    crash and burn the machine when an interrupt handler is buggy.

    Some test results on a Core2Duo machine:

    Cache cold run of:
    # time git grep irq_desc

    non-threaded threaded
    real 1m18.741s 1m19.061s
    user 0m1.874s 0m1.757s
    sys 0m5.843s 0m5.427s

    # iperf -c server
    non-threaded
    [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec
    [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec
    [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec
    threaded
    [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec
    [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec
    [ 3] 0.0-10.0 sec 1.09 GBytes 937 Mbits/sec

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    LKML-Reference:

    Thomas Gleixner