25 Feb, 2014
6 commits
-
The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.Lets rename the function and enhance the comments so that they reflect
these properties.Suggested-by: Christoph Hellwig
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
in an object. It looks like a nice and convenient pattern at the first
sight because we can then retrieve the object from the IPI handler easily.But actually it is a waste of memory space in the object since the csd
can be allocated from the stack by smp_call_function_single(wait = 1)
and the object can be passed an the IPI argument.Besides that, embedding the csd in an object is more error prone
because the caller must take care of the serialization of the IPIs
for this csd.2) The user calls __smp_call_function_single(wait = 1) on a csd that
is allocated on the stack. It's ok but smp_call_function_single()
can do it as well and it already takes care of the allocation on the
stack. Again it's more simple and less error prone.Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.There was a single user of that which has just been converted.
So lets remove this option to discourage further users.Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
Move this function closer to __smp_call_function_single(). These functions
have very similar behavior and should be displayed in the same block
for clarity.Reviewed-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
__smp_call_function_single() and smp_call_function_single() share some
code that can be factorized: execute inline when the target is local,
check if the target is online, lock the csd, call generic_exec_single().Lets move the common parts to generic_exec_single().
Reviewed-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
Align __smp_call_function_single() with smp_call_function_single() so
that it also checks whether requested cpu is still online.Signed-off-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
The IPI function llist iteration is open coded. Lets simplify this
with using an llist iterator.Also we want to keep the iteration safe against possible
csd.llist->next value reuse from the IPI handler. At least the block
subsystem used to do such things so lets stay careful and use
llist_for_each_entry_safe().Signed-off-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe
31 Jan, 2014
2 commits
-
After commit 9a46ad6d6df3 ("smp: make smp_call_function_many() use logic
similar to smp_call_function_single()"), cfd->cpumask is accessed only
in smp_call_function_many(). So there is no more need to copy it into
cfd->cpumask_ipi before putting csd into the list. The cpumask_ipi
field is obsolete and can be removed.Signed-off-by: Roman Gushchin
Cc: Ingo Molnar
Cc: Christoph Hellwig
Cc: Wang YanQing
Cc: Xie XiuQi
Cc: Shaohua Li
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make smp_call_function_single and friends more efficient by using a
lockless list.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Nov, 2013
2 commits
-
Signed-off-by: Christoph Hellwig
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We've switched over every architecture that supports SMP to it, so
remove the new useless config variable.Signed-off-by: Christoph Hellwig
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Nov, 2013
1 commit
-
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
25 Oct, 2013
2 commits
-
blk-mq reuses the request potentially immediately, since the most
cache hot is always given out first. This means that rq->csd could
be reused between csd->func() being called and csd_unlock() being
called. This isn't a problem, since we never use wait == 1 for
the smp call function. Add CSD_FLAG_WAIT to be able to tell the
difference, retaining the warning for other cases.Cc: Ingo Molnar
Signed-off-by: Jens Axboe -
The blk-mq core and the blk-mq null driver uses it.
Reviewed-by: Christoph Hellwig
Acked-by: Ingo Molnar
Signed-off-by: Jens Axboe
01 Oct, 2013
1 commit
-
Turn it into (for example):
[ 0.073380] x86: Booting SMP configuration:
[ 0.074005] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.603005] .... node #1, CPUs: #8 #9 #10 #11 #12 #13 #14 #15
[ 1.200005] .... node #2, CPUs: #16 #17 #18 #19 #20 #21 #22 #23
[ 1.796005] .... node #3, CPUs: #24 #25 #26 #27 #28 #29 #30 #31
[ 2.393005] .... node #4, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
[ 2.996005] .... node #5, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
[ 3.600005] .... node #6, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
[ 4.202005] .... node #7, CPUs: #56 #57 #58 #59 #60 #61 #62 #63
[ 4.811005] .... node #8, CPUs: #64 #65 #66 #67 #68 #69 #70 #71
[ 5.421006] .... node #9, CPUs: #72 #73 #74 #75 #76 #77 #78 #79
[ 6.032005] .... node #10, CPUs: #80 #81 #82 #83 #84 #85 #86 #87
[ 6.648006] .... node #11, CPUs: #88 #89 #90 #91 #92 #93 #94 #95
[ 7.262005] .... node #12, CPUs: #96 #97 #98 #99 #100 #101 #102 #103
[ 7.865005] .... node #13, CPUs: #104 #105 #106 #107 #108 #109 #110 #111
[ 8.466005] .... node #14, CPUs: #112 #113 #114 #115 #116 #117 #118 #119
[ 9.073006] .... node #15, CPUs: #120 #121 #122 #123 #124 #125 #126 #127
[ 9.679901] x86: Booted up 16 nodes, 128 CPUsand drop useless elements.
Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.Signed-off-by: Borislav Petkov
Cc: huawei.libin@huawei.com
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20130930095624.GB16383@pd.tnic
Signed-off-by: Ingo Molnar
12 Sep, 2013
2 commits
-
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled.I don't know of any bugs currently caused by this unconditional
local_irq_enable(), but I want to use this function in MIPS/OCTEON early
boot (when we have early_boot_irqs_disabled). This also makes this
function have similar semantics to on_each_cpu() which is good in
itself.Signed-off-by: David Daney
Cc: Gilad Ben-Yossef
Cc: Christoph Lameter
Cc: Chris Metcalf
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When failure occurs in hotplug_cfd(), need release related resources, or
will cause memory leak.Signed-off-by: Chen Gang
Acked-by: Wang YanQing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Sep, 2013
1 commit
-
Pull scheduler changes from Ingo Molnar:
"Various optimizations, cleanups and smaller fixes - no major changes
in scheduler behavior"* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the sd_parent_degenerate() code
sched/fair: Rework and comment the group_imb code
sched/fair: Optimize find_busiest_queue()
sched/fair: Make group power more consistent
sched/fair: Remove duplicate load_per_task computations
sched/fair: Shrink sg_lb_stats and play memset games
sched: Clean-up struct sd_lb_stat
sched: Factor out code to should_we_balance()
sched: Remove one division operation in find_busiest_queue()
sched/cputime: Use this_cpu_add() in task_group_account_field()
cpumask: Fix cpumask leak in partition_sched_domains()
sched/x86: Optimize switch_mm() for multi-threaded workloads
generic-ipi: Kill unnecessary variable - csd_flags
numa: Mark __node_set() as __always_inline
sched/fair: Cleanup: remove duplicate variable declaration
sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
19 Aug, 2013
1 commit
-
Fix locking description: after commit 8969a5ede0f9e17da4b9437
("generic-ipi: remove kmalloc()"), wait = 0 can be guaranteed
because we don't kmalloc() anymore.Signed-off-by: Xie XiuQi
Cc: Sheng Yang
Cc: Peter Zijlstra
Cc: Jens Axboe
Cc: Rusty Russell
Link: http://lkml.kernel.org/r/51F5E6F8.1000801@huawei.com
Signed-off-by: Ingo Molnar
31 Jul, 2013
1 commit
-
After commit 8969a5ede0f9e17da4b943712429aef2c9bcd82b
("generic-ipi: remove kmalloc()"), wait = 0 can be guaranteed,
and all callsites of generic_exec_single() do an unconditional
csd_lock() now.So csd_flags is unnecessary now. Remove it.
Signed-off-by: Xie XiuQi
Signed-off-by: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Linus Torvalds
Cc: Nick Piggin
Cc: Jens Axboe
Cc: "Paul E. McKenney"
Cc: Rusty Russell
Link: http://lkml.kernel.org/r/51F72DA1.7010401@huawei.com
Signed-off-by: Ingo Molnar
15 Jul, 2013
1 commit
-
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.[1] https://lkml.org/lkml/2013/5/20/589
Signed-off-by: Paul Gortmaker
01 May, 2013
2 commits
-
We sometimes use "struct call_single_data *data" and sometimes "struct
call_single_data *csd". Use "csd" consistently.We sometimes use "struct call_function_data *data" and sometimes "struct
call_function_data *cfd". Use "cfd" consistently.Also, avoid some 80-col layout tricks.
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Shaohua Li
Cc: Shaohua Li
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
csd_lock() uses assignment to data->flags rather than |=. That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.But it will become buggy if we later add another flag, so fix it now.
Signed-off-by: liguang
Cc: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Feb, 2013
1 commit
-
I'm testing swapout workload in a two-socket Xeon machine. The workload
has 10 threads, each thread sequentially accesses separate memory
region. TLB flush overhead is very big in the workload. For each page,
page reclaim need move it from active lru list and then unmap it. Both
need a TLB flush. And this is a multthread workload, TLB flush happens
in 10 CPUs. In X86, TLB flush uses generic smp_call)function. So this
workload stress smp_call_function_many heavily.Without patch, perf shows:
+ 24.49% [k] generic_smp_call_function_interrupt
- 21.72% [k] _raw_spin_lock
- _raw_spin_lock
+ 79.80% __page_check_address
+ 6.42% generic_smp_call_function_interrupt
+ 3.31% get_swap_page
+ 2.37% free_pcppages_bulk
+ 1.75% handle_pte_fault
+ 1.54% put_super
+ 1.41% grab_super_passive
+ 1.36% __swap_duplicate
+ 0.68% blk_flush_plug_list
+ 0.62% swap_info_get
+ 6.55% [k] flush_tlb_func
+ 6.46% [k] smp_call_function_many
+ 5.09% [k] call_function_interrupt
+ 4.75% [k] default_send_IPI_mask_sequence_phys
+ 2.18% [k] find_next_bitswapout throughput is around 1300M/s.
With the patch, perf shows:
- 27.23% [k] _raw_spin_lock
- _raw_spin_lock
+ 80.53% __page_check_address
+ 8.39% generic_smp_call_function_single_interrupt
+ 2.44% get_swap_page
+ 1.76% free_pcppages_bulk
+ 1.40% handle_pte_fault
+ 1.15% __swap_duplicate
+ 1.05% put_super
+ 0.98% grab_super_passive
+ 0.86% blk_flush_plug_list
+ 0.57% swap_info_get
+ 8.25% [k] default_send_IPI_mask_sequence_phys
+ 7.55% [k] call_function_interrupt
+ 7.47% [k] smp_call_function_many
+ 7.25% [k] flush_tlb_func
+ 3.81% [k] _raw_spin_lock_irqsave
+ 3.78% [k] generic_smp_call_function_single_interruptswapout throughput is around 1400M/s. So there is around a 7%
improvement, and total cpu utilization doesn't change.Without the patch, cfd_data is shared by all CPUs.
generic_smp_call_function_interrupt does read/write cfd_data several times
which will create a lot of cache ping-pong. With the patch, the data
becomes per-cpu. The ping-pong is avoided. And from the perf data, this
doesn't make call_single_queue lock contend.Next step is to remove generic_smp_call_function_interrupt() from arch
code.Signed-off-by: Shaohua Li
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Steven Rostedt
Cc: Jens Axboe
Cc: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jan, 2013
1 commit
-
I get the following warning every day with v3.7, once or
twice a day:[ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()
As explained by Linus as well:
|
| Once we've done the "list_add_rcu()" to add it to the
| queue, we can have (another) IPI to the target CPU that can
| now see it and clear the mask.
|
| So by the time we get to actually send the IPI, the mask might
| have been cleared by another IPI.
|This patch also fixes a system hang problem, if the data->cpumask
gets cleared after passing this point:if (WARN_ONCE(!mask, "empty IPI mask"))
return;then the problem in commit 83d349f35e1a ("x86: don't send an IPI to
the empty set of CPU's") will happen again.Signed-off-by: Wang YanQing
Acked-by: Linus Torvalds
Acked-by: Jan Beulich
Cc: Paul E. McKenney
Cc: Andrew Morton
Cc: peterz@infradead.org
Cc: mina86@mina86.org
Cc: srivatsa.bhat@linux.vnet.ibm.com
Cc:
Link: http://lkml.kernel.org/r/20130126075357.GA3205@udknight
[ Tidied up the changelog and the comment in the code. ]
Signed-off-by: Ingo Molnar
05 Jun, 2012
1 commit
-
There is no user of those APIs anymore, just remove it.
Signed-off-by: Yong Zhang
Cc: ralf@linux-mips.org
Cc: sshtylyov@mvista.com
Cc: david.daney@cavium.com
Cc: nikunj@linux.vnet.ibm.com
Cc: paulmck@linux.vnet.ibm.com
Cc: axboe@kernel.dk
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1338275765-3217-11-git-send-email-yong.zhang0@gmail.com
Acked-by: Srivatsa S. Bhat
Acked-by: Peter Zijlstra
Signed-off-by: Thomas Gleixner
08 May, 2012
1 commit
-
Will replace the misnomed cpu_idle_wait() function which is copied a
gazillion times all over arch/*Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de
04 May, 2012
1 commit
-
percpu areas are already allocated during boot for each possible cpu.
percpu idle threads can be considered as an extension of the percpu areas,
and allocate them for each possible cpu during boot.This will eliminate the need for workqueue based idle thread allocation.
In future we can move the idle thread area into the percpu area too.[ tglx: Moved the loop into smpboot.c and added an error check when
the init code failed to allocate an idle thread for a cpu which
should be onlined ]Signed-off-by: Suresh Siddha
Cc: Peter Zijlstra
Cc: Rusty Russell
Cc: Paul E. McKenney
Cc: Srivatsa S. Bhat
Cc: Tejun Heo
Cc: David Rientjes
Cc: venki@google.com
Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner
29 Mar, 2012
2 commits
-
Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
calculates the cpumask of cpus to IPI by calling a function supplied as a
parameter in order to determine whether to IPI each specific cpu.The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
via smp_call_function_single().The function is useful since it allows to seperate the specific code that
decided in each case whether to IPI a specific cpu for a specific request
from the common boilerplate code of handling creating the mask, handling
failures etc.[akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
[akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
[akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
Signed-off-by: Gilad Ben-Yossef
Cc: Chris Metcalf
Cc: Christoph Lameter
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Sasha Levin
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Alexander Viro
Cc: Avi Kivity
Acked-by: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Reviewed-by: "Srivatsa S. Bhat"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs. These IPIs are useful and cannot be avoided
altogether, but in certain cases it is possible to interrupt only specific
CPUs that have useful work to do and not the entire system.This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs. The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all. Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
infrastructure API (the former derived from existing arch specific
versions in Tile and Arm) and uses them to turn several global IPI
invocation to per CPU group invocations.Core kernel:
on_each_cpu_mask() calls a function on processors specified by cpumask,
which may or may not include the local processor.You must not call this function with disabled interrupts or from a
hardware interrupt handler or from a bottom half handler.arch/arm:
Note that the generic version is a little different then the Arm one:
1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
but this should be OK since the function is called on the other CPUs
with interrupts disabled anyway.arch/tile:
The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP caseThis is OK since the function is called on the other CPUs
with interrupts disabled.Signed-off-by: Gilad Ben-Yossef
Reviewed-by: Christoph Lameter
Acked-by: Chris Metcalf
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Sasha Levin
Cc: Mel Gorman
Cc: Alexander Viro
Cc: Avi Kivity
Acked-by: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Cc: Russell King
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Oct, 2011
1 commit
-
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.Nothing to see here but a whole lot of instances of:
-#include
+#includeThis commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.Signed-off-by: Paul Gortmaker
17 Jun, 2011
1 commit
-
There is a problem that kdump(2nd kernel) sometimes hangs up due
to a pending IPI from 1st kernel. Kernel panic occurs because IPI
comes before call_single_queue is initialized.To fix the crash, rename init_call_single_data() to call_function_init()
and call it in start_kernel() so that call_single_queue can be
initialized before enabling interrupts.The details of the crash are:
(1) 2nd kernel boots up
(2) A pending IPI from 1st kernel comes when irqs are first enabled
in start_kernel().(3) Kernel tries to handle the interrupt, but call_single_queue
is not initialized yet at this point. As a result, in the
generic_smp_call_function_single_interrupt(), NULL pointer
dereference occurs when list_replace_init() tries to access
&q->list.next.Therefore this patch changes the name of init_call_single_data()
to call_function_init() and calls it before local_irq_enable()
in start_kernel().Signed-off-by: Takao Indoh
Reviewed-by: WANG Cong
Acked-by: Neil Horman
Acked-by: Vivek Goyal
Acked-by: Peter Zijlstra
Cc: Milton Miller
Cc: Jens Axboe
Cc: Paul E. McKenney
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/D6CBEE2F420741indou.takao@jp.fujitsu.com
Signed-off-by: Ingo Molnar
23 Mar, 2011
1 commit
-
Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
CONFIG_SMP.Signed-off-by: WANG Cong
Cc: Rakib Mullick
Cc: David Howells
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Arnd Bergmann
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2011
4 commits
-
Use the newly added smp_call_func_t in smp_call_function_interrupt for
the func variable, and make the comment above the WARN more assertive
and explicit. Also, func is a function pointer and does not need an
offset, so use %pf not %pS.Signed-off-by: Milton Miller
Signed-off-by: Linus Torvalds -
Mike Galbraith reported finding a lockup ("perma-spin bug") where the
cpumask passed to smp_call_function_many was cleared by other cpu(s)
while a cpu was preparing its call_data block, resulting in no cpu to
clear the last ref and unlock the block.Having cpus clear their bit asynchronously could be useful on a mask of
cpus that might have a translation context, or cpus that need a push to
complete an rcu window.Instead of adding a BUG_ON and requiring yet another cpumask copy, just
detect the race and handle it.Note: arch_send_call_function_ipi_mask must still handle an empty
cpumask because the data block is globally visible before the that arch
callback is made. And (obviously) there are no guarantees to which cpus
are notified if the mask is changed during the call; only cpus that were
online and had their mask bit set during the whole call are guaranteed
to be called.Reported-by: Mike Galbraith
Reported-by: Jan Beulich
Acked-by: Jan Beulich
Cc: stable@kernel.org
Signed-off-by: Milton Miller
Signed-off-by: Linus Torvalds -
Paul McKenney's review pointed out two problems with the barriers in the
2.6.38 update to the smp call function many code.First, a barrier that would force the func and info members of data to
be visible before their consumption in the interrupt handler was
missing. This can be solved by adding a smp_wmb between setting the
func and info members and setting setting the cpumask; this will pair
with the existing and required smp_rmb ordering the cpumask read before
the read of refs. This placement avoids the need a second smp_rmb in
the interrupt handler which would be executed on each of the N cpus
executing the call request. (I was thinking this barrier was present
but was not).Second, the previous write to refs (establishing the zero that we the
interrupt handler was testing from all cpus) was performed by a third
party cpu. This would invoke transitivity which, as a recient or
concurrent addition to memory-barriers.txt now explicitly states, would
require a full smp_mb().However, we know the cpumask will only be set by one cpu (the data
owner) and any preivous iteration of the mask would have cleared by the
reading cpu. By redundantly writing refs to 0 on the owning cpu before
the smp_wmb, the write to refs will follow the same path as the writes
that set the cpumask, which in turn allows us to keep the barrier in the
interrupt handler a smp_rmb instead of promoting it to a smp_mb (which
will be be executed by N cpus for each of the possible M elements on the
list).I moved and expanded the comment about our (ab)use of the rcu list
primitives for the concurrent walk earlier into this function. I
considered moving the first two paragraphs to the queue list head and
lock, but felt it would have been too disconected from the code.Cc: Paul McKinney
Cc: stable@kernel.org (2.6.32 and later)
Signed-off-by: Milton Miller
Signed-off-by: Linus Torvalds -
Peter pointed out there was nothing preventing the list_del_rcu in
smp_call_function_interrupt from running before the list_add_rcu in
smp_call_function_many.Fix this by not setting refs until we have gotten the lock for the list.
Take advantage of the wmb in list_add_rcu to save an explicit additional
one.I tried to force this race with a udelay before the lock & list_add and
by mixing all 64 online cpus with just 3 random cpus in the mask, but
was unsuccessful. Still, inspection shows a valid race, and the fix is
a extension of the existing protection window in the current code.Cc: stable@kernel.org (v2.6.32 and later)
Reported-by: Peter Zijlstra
Signed-off-by: Milton Miller
Signed-off-by: Linus Torvalds
21 Jan, 2011
3 commits
-
…/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c
lockdep: Move early boot local IRQ enable/disable status to init/main.c -
We have to test the cpu mask in the interrupt handler before checking the
refs, otherwise we can start to follow an entry before its deleted and
find it partially initailzed for the next trip. Presently we also clear
the cpumask bit before executing the called function, which implies
getting write access to the line. After the function is called we then
decrement refs, and if they go to zero we then unlock the structure.However, this implies getting write access to the call function data
before and after another the function is called. If we can assert that no
smp_call_function execution function is allowed to enable interrupts, then
we can move both writes to after the function is called, hopfully allowing
both writes with one cache line bounce.On a 256 thread system with a kernel compiled for 1024 threads, the time
to execute testcase in the "smp_call_function_many race" changelog was
reduced by about 30-40ms out of about 545 ms.I decided to keep this as WARN because its now a buggy function, even
though the stack trace is of no value -- a simple printk would give us the
information needed.Raw data:
Without patch:
ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
ipi_test startup took 21245824ns complete 530280180ns total 551526004nsWith patch:
ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
ipi_test startup took 6789954ns complete 493388112ns total 500178066ns#include
#include
#include /* sched clock */#define ITERATIONS 100
static void do_nothing_ipi(void *dummy)
{
}static void do_ipis(struct work_struct *dummy)
{
int i;for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}static struct work_struct work[NR_CPUS];
static int __init testcase_init(void)
{
int cpu;
u64 start, started, done;start = local_clock();
for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}
started = local_clock();
for_each_online_cpu(cpu)
flush_work(&work[cpu]);
done = local_clock();
pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
started-start, done-started, done-start);return 0;
}static void __exit testcase_exit(void)
{
}module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");Signed-off-by: Milton Miller
Cc: Anton Blanchard
Cc: Ingo Molnar
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;data->csd.func(data->csd.info);
refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); cpumask sees and
clears bit in cpumask
might be using old or new fn!
decrements refs below 0set data->refs (too late!)
The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the owner
is in the process of initialising it.The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)#include
#include#define ITERATIONS 100
static void do_nothing_ipi(void *dummy)
{
}static void do_ipis(struct work_struct *dummy)
{
int i;for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}static struct work_struct work[NR_CPUS];
static int __init testcase_init(void)
{
int cpu;for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}return 0;
}static void __exit testcase_exit(void)
{
}module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask then
->refs _again_.Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
[miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
[miltonm@bga.com: remove excess tests]
Signed-off-by: Anton Blanchard
Signed-off-by: Milton Miller
Cc: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: [2.6.32+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Jan, 2011
1 commit
-
percpu may end up calling vfree() during early boot which in
turn may call on_each_cpu() for TLB flushes. The function of
on_each_cpu() can be done safely while IRQ is disabled during
early boot but it assumed that the function is always called
with local IRQ enabled which ended up enabling local IRQ
prematurely during boot and triggering a couple of warnings.This patch updates on_each_cpu() and smp_call_function_many()
such on_each_cpu() can be used safely while
early_boot_irqs_disabled is set.Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra
Acked-by: Pekka Enberg
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar
Reported-by: Ingo Molnar