16 Aug, 2018

1 commit

  • commit bc2d8d262cba5736332cbc866acb11b1c5748aa9 upstream

    Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
    cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
    on the kernel command line as it cannot differentiate between SMT disabled
    by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
    makes the sysfs interface unusable.

    Rework this so that during bringup of the non boot CPUs the availability of
    SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
    'primary' thread then set the local cpu_smt_available marker and evaluate
    this explicitely right after the initial SMP bringup has finished.

    SMT evaulation on x86 is a trainwreck as the firmware has all the
    information _before_ booting the kernel, but there is no interface to query
    it.

    Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
    Reported-by: Josh Poimboeuf
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Sep, 2017

1 commit

  • First, number of CPUs can't be negative number.

    Second, different signnnedness leads to suboptimal code in the following
    cases:

    1)
    kmalloc(nr_cpu_ids * sizeof(X));

    "int" has to be sign extended to size_t.

    2)
    while (loff_t *pos < nr_cpu_ids)

    MOVSXD is 1 byte longed than the same MOV.

    Other cases exist as well. Basically compiler is told that nr_cpu_ids
    can't be negative which can't be deduced if it is "int".

    Code savings on allyesconfig kernel: -3KB

    add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
    function old new delta
    coretemp_cpu_online 450 512 +62
    rcu_init_one 1234 1272 +38
    pci_device_probe 374 399 +25

    ...

    pgdat_reclaimable_pages 628 556 -72
    select_fallback_rq 446 369 -77
    task_numa_find_cpu 1923 1807 -116

    Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

29 Aug, 2017

1 commit

  • struct call_single_data is used in IPIs to transfer information between
    CPUs. Its size is bigger than sizeof(unsigned long) and less than
    cache line size. Currently it is not allocated with any explicit alignment
    requirements. This makes it possible for allocated call_single_data to
    cross two cache lines, which results in double the number of the cache lines
    that need to be transferred among CPUs.

    This can be fixed by requiring call_single_data to be aligned with the
    size of call_single_data. Currently the size of call_single_data is the
    power of 2. If we add new fields to call_single_data, we may need to
    add padding to make sure the size of new definition is the power of 2
    as well.

    Fortunately, this is enforced by GCC, which will report bad sizes.

    To set alignment requirements of call_single_data to the size of
    call_single_data, a struct definition and a typedef is used.

    To test the effect of the patch, I used the vm-scalability multiple
    thread swap test case (swap-w-seq-mt). The test will create multiple
    threads and each thread will eat memory until all RAM and part of swap
    is used, so that huge number of IPIs are triggered when unmapping
    memory. In the test, the throughput of memory writing improves ~5%
    compared with misaligned call_single_data, because of faster IPIs.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Huang, Ying
    [ Add call_single_data_t and align with size of call_single_data. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Borislav Petkov
    Cc: Eric Dumazet
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
    Signed-off-by: Ingo Molnar

    Ying Huang
     

23 May, 2017

2 commits

  • The cpumasks in smp_call_function_many() are private and not subject
    to concurrency, atomic bitops are pointless and expensive.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
    process' mm_cpumask() shows the process has ever run on other CPUs. page
    migration, page reclaim all need IPIs. The number of IPI needed to send
    to different CPUs is especially large for multi-threaded workload since
    mm_cpumask() is per process.

    For smp_call_function_many(), whenever a CPU queues a CSD to a target
    CPU, it will send an IPI to let the target CPU to handle the work.
    This isn't necessary - we need only send IPI when queueing a CSD
    to an empty call_single_queue.

    The reason:

    flush_smp_call_function_queue() that is called upon a CPU receiving an
    IPI will empty the queue and then handle all of the CSDs there. So if
    the target CPU's call_single_queue is not empty, we know that:
    i. An IPI for the target CPU has already been sent by 'previous queuers';
    ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
    Thus, it's safe for us to just queue our CSD there without sending an
    addtional IPI. And for the 'previous queuers', we can limit it to the
    first queuer.

    To demonstrate the effect of this patch, a multi-thread workload that
    spawns 80 threads to equally consume 100G memory is used. This is tested
    on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
    after 32G memory is used up, page reclaiming starts to happen a lot.

    With this patch, IPI number dropped 88% and throughput increased about
    15% for the above workload.

    Signed-off-by: Aaron Lu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20170519075331.GE2084@aaronlu.sh.intel.com
    Signed-off-by: Ingo Molnar

    Aaron Lu
     

02 Mar, 2017

1 commit


26 Oct, 2016

3 commits

  • Currently we don't print anything before starting to bring up secondary
    CPUs. This can be confusing if it takes a long time to bring up the
    secondaries, or if the kernel crashes while doing so and produces no
    further output.

    On x86 they work around this by detecting when the first secondary CPU
    comes up and printing a message (see announce_cpu()). But doing it in
    smp_init() is simpler and works for all arches.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Borislav Petkov
    Cc: akpm@osdl.org
    Cc: jgross@suse.com
    Cc: ak@linux.intel.com
    Cc: tim.c.chen@linux.intel.com
    Cc: len.brown@intel.com
    Cc: peterz@infradead.org
    Cc: richard@nod.at
    Cc: jolsa@redhat.com
    Cc: boris.ostrovsky@oracle.com
    Cc: mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1477460275-8266-3-git-send-email-mpe@ellerman.id.au
    Signed-off-by: Thomas Gleixner

    Michael Ellerman
     
  • Currently after bringing up secondary CPUs all arches print "Brought up
    %d CPUs". On x86 they also print the number of nodes that were brought
    online.

    It would be nice to also print the number of nodes on other arches.
    Although we could override smp_announce() on the other ~10 NUMA aware
    arches, it seems simpler to just always print the number of nodes. On
    non-NUMA arches there is just always 1 node.

    Having done that, smp_announce() is no longer weak, and seems small
    enough to just pull directly into smp_init().

    Also update the printing of "%d CPUs" to be smart when an SMP kernel is
    booted on a single CPU system, or when only one CPU is available, eg:

    smp: Brought up 2 nodes, 1 CPU

    Signed-off-by: Michael Ellerman
    Reviewed-by: Borislav Petkov
    Cc: akpm@osdl.org
    Cc: jgross@suse.com
    Cc: ak@linux.intel.com
    Cc: tim.c.chen@linux.intel.com
    Cc: len.brown@intel.com
    Cc: peterz@infradead.org
    Cc: richard@nod.at
    Cc: jolsa@redhat.com
    Cc: boris.ostrovsky@oracle.com
    Cc: mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1477460275-8266-2-git-send-email-mpe@ellerman.id.au
    Signed-off-by: Thomas Gleixner

    Michael Ellerman
     
  • This makes all our pr_xxx()'s start with "smp: ", which helps pin down
    where they come from and generally looks nice. There is actually only
    one pr_xxx() use in smp.c at the moment, but we will add some more in
    the next commit.

    Suggested-by: Borislav Petkov
    Signed-off-by: Michael Ellerman
    Cc: akpm@osdl.org
    Cc: jgross@suse.com
    Cc: ak@linux.intel.com
    Cc: tim.c.chen@linux.intel.com
    Cc: len.brown@intel.com
    Cc: peterz@infradead.org
    Cc: richard@nod.at
    Cc: jolsa@redhat.com
    Cc: boris.ostrovsky@oracle.com
    Cc: mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1477460275-8266-1-git-send-email-mpe@ellerman.id.au
    Signed-off-by: Thomas Gleixner

    Michael Ellerman
     

22 Sep, 2016

1 commit

  • The SMP IPI struct descriptor is allocated on the stack except for the
    workqueue and lockdep complains:

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 0 PID: 110 Comm: kworker/0:1 Not tainted 4.8.0-rc5+ #14
    Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
    Workqueue: events smp_call_on_cpu_callback
    ...
    Call Trace:
    dump_stack
    register_lock_class
    ? __lock_acquire
    __lock_acquire
    ? __lock_acquire
    lock_acquire
    ? process_one_work
    process_one_work
    ? process_one_work
    worker_thread
    ? process_one_work
    ? process_one_work
    kthread
    ? kthread_create_on_node
    ret_from_fork

    So allocate it on the stack too.

    Signed-off-by: Peter Zijlstra (Intel)
    [ Test and write commit message. ]
    Signed-off-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160911084323.jhtnpb4b37t5tlno@pd.tnic
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Sep, 2016

2 commits

  • On some hardware models (e.g. Dell Studio 1555 laptop) some hardware
    related functions (e.g. SMIs) are to be executed on physical CPU 0
    only. Instead of open coding such a functionality multiple times in
    the kernel add a service function for this purpose. This will enable
    the possibility to take special measures in virtualized environments
    like Xen, too.

    Signed-off-by: Juergen Gross
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Douglas_Warzecha@dell.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akataria@vmware.com
    Cc: boris.ostrovsky@oracle.com
    Cc: chrisw@sous-sol.org
    Cc: david.vrabel@citrix.com
    Cc: hpa@zytor.com
    Cc: jdelvare@suse.com
    Cc: jeremy@goop.org
    Cc: linux@roeck-us.net
    Cc: pali.rohar@gmail.com
    Cc: rusty@rustcorp.com.au
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1472453327-19050-4-git-send-email-jgross@suse.com
    Signed-off-by: Ingo Molnar

    Juergen Gross
     
  • Add generic virtualization support for pinning the current vCPU to a
    specified physical CPU. As this operation isn't performance critical
    (a very limited set of operations like BIOS calls and SMIs is expected
    to need this) just add a hypervisor specific indirection.

    Signed-off-by: Juergen Gross
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Douglas_Warzecha@dell.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akataria@vmware.com
    Cc: boris.ostrovsky@oracle.com
    Cc: chrisw@sous-sol.org
    Cc: david.vrabel@citrix.com
    Cc: hpa@zytor.com
    Cc: jdelvare@suse.com
    Cc: jeremy@goop.org
    Cc: linux@roeck-us.net
    Cc: pali.rohar@gmail.com
    Cc: rusty@rustcorp.com.au
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1472453327-19050-3-git-send-email-jgross@suse.com
    Signed-off-by: Ingo Molnar

    Juergen Gross
     

30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

15 Jul, 2016

1 commit

  • Install the callbacks via the state machine. They are installed at runtime so
    smpcfd_prepare_cpu() needs to be invoked by the boot-CPU.

    Signed-off-by: Richard Weinberger
    [ Added the dropped CPU dying case back in. ]
    Signed-off-by: Richard Cochran
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Thomas Gleixner
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153337.818376366@linutronix.de
    Signed-off-by: Ingo Molnar

    Richard Weinberger
     

14 Jun, 2016

1 commit

  • This new form allows using hardware assisted waiting.

    Some hardware (ARM64 and x86) allow monitoring an address for changes,
    so by providing a pointer we can use this to replace the cpu_relax()
    with hardware optimized methods in the future.

    Requested-by: Will Deacon
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Mar, 2016

1 commit

  • Pull cpu hotplug updates from Thomas Gleixner:
    "This is the first part of the ongoing cpu hotplug rework:

    - Initial implementation of the state machine

    - Runs all online and prepare down callbacks on the plugged cpu and
    not on some random processor

    - Replaces busy loop waiting with completions

    - Adds tracepoints so the states can be followed"

    More detailed commentary on this work from an earlier email:
    "What's wrong with the current cpu hotplug infrastructure?

    - Asymmetry

    The hotplug notifier mechanism is asymmetric versus the bringup and
    teardown. This is mostly caused by the notifier mechanism.

    - Largely undocumented dependencies

    While some notifiers use explicitely defined notifier priorities,
    we have quite some notifiers which use numerical priorities to
    express dependencies without any documentation why.

    - Control processor driven

    Most of the bringup/teardown of a cpu is driven by a control
    processor. While it is understandable, that preperatory steps,
    like idle thread creation, memory allocation for and initialization
    of essential facilities needs to be done before a cpu can boot,
    there is no reason why everything else must run on a control
    processor. Before this patch series, bringup looks like this:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring the rest up

    - All or nothing approach

    There is no way to do partial bringups. That's something which is
    really desired because we waste e.g. at boot substantial amount of
    time just busy waiting that the cpu comes to life. That's stupid
    as we could very well do preparatory steps and the initial IPI for
    other cpus and then go back and do the necessary low level
    synchronization with the freshly booted cpu.

    - Minimal debuggability

    Due to the notifier based design, it's impossible to switch between
    two stages of the bringup/teardown back and forth in order to test
    the correctness. So in many hotplug notifiers the cancel
    mechanisms are either not existant or completely untested.

    - Notifier [un]registering is tedious

    To [un]register notifiers we need to protect against hotplug at
    every callsite. There is no mechanism that bringup/teardown
    callbacks are issued on the online cpus, so every caller needs to
    do it itself. That also includes error rollback.

    What's the new design?

    The base of the new design is a symmetric state machine, where both
    the control processor and the booting/dying cpu execute a well
    defined set of states. Each state is symmetric in the end, except
    for some well defined exceptions, and the bringup/teardown can be
    stopped and reversed at almost all states.

    So the bringup of a cpu will look like this in the future:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring itself up

    The synchronization step does not require the control cpu to wait.
    That mechanism can be done asynchronously via a worker or some
    other mechanism.

    The teardown can be made very similar, so that the dying cpu cleans
    up and brings itself down. Cleanups which need to be done after
    the cpu is gone, can be scheduled asynchronously as well.

    There is a long way to this, as we need to refactor the notion when a
    cpu is available. Today we set the cpu online right after it comes
    out of the low level bringup, which is not really correct.

    The proper mechanism is to set it to available, i.e. cpu local
    threads, like softirqd, hotplug thread etc. can be scheduled on that
    cpu, and once it finished all booting steps, it's set to online, so
    general workloads can be scheduled on it. The reverse happens on
    teardown. First thing to do is to forbid scheduling of general
    workloads, then teardown all the per cpu resources and finally shut it
    off completely.

    This patch series implements the basic infrastructure for this at the
    core level. This includes the following:

    - Basic state machine implementation with well defined states, so
    ordering and prioritization can be expressed.

    - Interfaces to [un]register state callbacks

    This invokes the bringup/teardown callback on all online cpus with
    the proper protection in place and [un]installs the callbacks in
    the state machine array.

    For callbacks which have no particular ordering requirement we have
    a dynamic state space, so that drivers don't have to register an
    explicit hotplug state.

    If a callback fails, the code automatically does a rollback to the
    previous state.

    - Sysfs interface to drive the state machine to a particular step.

    This is only partially functional today. Full functionality and
    therefor testability will be achieved once we converted all
    existing hotplug notifiers over to the new scheme.

    - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
    processor:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu
    wait for boot
    bring itself up

    Signal completion to control cpu

    In a previous step of this work we've done a full tree mechanical
    conversion of all hotplug notifiers to the new scheme. The balance
    is a net removal of about 4000 lines of code.

    This is not included in this series, as we decided to take a
    different approach. Instead of mechanically converting everything
    over, we will do a proper overhaul of the usage sites one by one so
    they nicely fit into the symmetric callback scheme.

    I decided to do that after I looked at the ugliness of some of the
    converted sites and figured out that their hotplug mechanism is
    completely buggered anyway. So there is no point to do a
    mechanical conversion first as we need to go through the usage
    sites one by one again in order to achieve a full symmetric and
    testable behaviour"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    cpu/hotplug: Document states better
    cpu/hotplug: Fix smpboot thread ordering
    cpu/hotplug: Remove redundant state check
    cpu/hotplug: Plug death reporting race
    rcu: Make CPU_DYING_IDLE an explicit call
    cpu/hotplug: Make wait for dead cpu completion based
    cpu/hotplug: Let upcoming cpu bring itself fully up
    arch/hotplug: Call into idle with a proper state
    cpu/hotplug: Move online calls to hotplugged cpu
    cpu/hotplug: Create hotplug threads
    cpu/hotplug: Split out the state walk into functions
    cpu/hotplug: Unpark smpboot threads from the state machine
    cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
    cpu/hotplug: Implement setup/removal interface
    cpu/hotplug: Make target state writeable
    cpu/hotplug: Add sysfs state interface
    cpu/hotplug: Hand in target state to _cpu_up/down
    cpu/hotplug: Convert the hotplugged cpu work to a state machine
    cpu/hotplug: Convert to a state machine for the control processor
    cpu/hotplug: Add tracepoints
    ...

    Linus Torvalds
     

10 Mar, 2016

2 commits

  • We can micro-optimize this call and mildly relax the
    barrier requirements by relying on ctrl + rmb, keeping
    the acquire semantics. In addition, this is pretty much
    the now standard for busy-waiting under such restraints.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1457574936-19065-3-git-send-email-dbueso@suse.de
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • While the compiler tends to already to it for us (except for
    csd_unlock), make it explicit. These helpers mainly deal with
    the ->flags, are short-lived and can be called, for example,
    from smp_call_function_many().

    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1457574936-19065-2-git-send-email-dbueso@suse.de
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

02 Mar, 2016

1 commit

  • In order to let the hotplugged cpu take care of the setup/teardown, we need a
    seperate hotplug thread.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.454541272@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

20 Apr, 2015

1 commit

  • Commit 8053871d0f7f ("smp: Fix smp_call_function_single_async()
    locking") fixed the locking for the asynchronous smp-call case, but in
    the process of moving the lock handling around, one of the error cases
    ended up not unlocking the call data at all.

    This went unnoticed on x86, because this is a "caller is buggy" case,
    where the caller is trying to call a non-existent CPU. But apparently
    ARM does that (at least under qemu-arm). Bindly doing cross-cpu calls
    to random CPU's that aren't even online seems a bit fishy, but the error
    handling was clearly not correct.

    Simply add the missing "csd_unlock()" to the error path.

    Reported-and-tested-by: Guenter Roeck
    Analyzed-by: Rabin Vincent
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Apr, 2015

1 commit

  • The current smp_function_call code suffers a number of problems, most
    notably smp_call_function_single_async() is broken.

    The problem is that flush_smp_call_function_queue() does csd_unlock()
    _after_ calling csd->func(). This means that a caller cannot properly
    synchronize the csd usage as it has to.

    Change the code to release the csd before calling ->func() for the
    async case, and put a WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK) in
    smp_call_function_single_async() to warn us of improper serialization,
    because any waiting there can results in deadlocks when called with
    IRQs disabled.

    Rename the (currently) unused WAIT flag to SYNCHRONOUS and (re)use it
    such that we know what to do in flush_smp_call_function_queue().

    Rework csd_{,un}lock() to use smp_load_acquire() / smp_store_release()
    to avoid some full barriers while more clearly providing lock
    semantics.

    Finally move the csd maintenance out of generic_exec_single() into its
    callers for clearer code.

    Signed-off-by: Linus Torvalds
    [ Added changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Jens Axboe
    Cc: Rafael David Tinoco
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

19 Sep, 2014

1 commit

  • Currently kick_all_cpus_sync() can break non-polling idle cpus
    thru IPI interrupts.

    But sometimes we need to break the polling idle cpus immediately
    to reselect the suitable c-state, also for non-idle cpus, we need
    to do nothing if we try to wake up them.

    Here adding one new function wake_up_all_idle_cpus() to let all cpus
    out of idle based on function wake_up_if_idle().

    Signed-off-by: Chuansheng Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: daniel.lezcano@linaro.org
    Cc: rjw@rjwysocki.net
    Cc: linux-pm@vger.kernel.org
    Cc: changcheng.liu@intel.com
    Cc: xiaoming.wang@intel.com
    Cc: souvik.k.chakravarty@intel.com
    Cc: luto@amacapital.net
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Michal Hocko
    Cc: Paul Gortmaker
    Cc: Roman Gushchin
    Cc: Srivatsa S. Bhat
    Link: http://lkml.kernel.org/r/1409815075-4180-2-git-send-email-chuansheng.liu@intel.com
    Signed-off-by: Ingo Molnar

    Chuansheng Liu
     

27 Aug, 2014

1 commit


07 Aug, 2014

1 commit

  • The rarely-executed memry-allocation-failed callback path generates a
    WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably
    it's supposed to warn on failures.

    Signed-off-by: Sasha Levin
    Cc: Christoph Lameter
    Cc: Gilad Ben-Yossef
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

16 Jul, 2014

1 commit


24 Jun, 2014

1 commit

  • There is a race between the CPU offline code (within stop-machine) and
    the smp-call-function code, which can lead to getting IPIs on the
    outgoing CPU, *after* it has gone offline.

    Specifically, this can happen when using
    smp_call_function_single_async() to send the IPI, since this API allows
    sending asynchronous IPIs from IRQ disabled contexts. The exact race
    condition is described below.

    During CPU offline, in stop-machine, we don't enforce any rule in the
    _DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
    the other CPUs disable their local interrupts. Due to this, we can
    encounter a situation in which an IPI is sent by one of the other CPUs
    to the outgoing CPU (while it is *still* online), but the outgoing CPU
    ends up noticing it only *after* it has gone offline.

    CPU 1 CPU 2
    (Online CPU) (CPU going offline)

    Enter _PREPARE stage Enter _PREPARE stage

    Enter _DISABLE_IRQ stage

    =
    Got a device interrupt, and | Didn't notice the IPI
    the interrupt handler sent an | since interrupts were
    IPI to CPU 2 using | disabled on this CPU.
    smp_call_function_single_async() |
    =

    Enter _DISABLE_IRQ stage

    Enter _RUN stage Enter _RUN stage

    =
    Busy loop with interrupts | Invoke take_cpu_down()
    disabled. | and take CPU 2 offline
    =

    Enter _EXIT stage Enter _EXIT stage

    Re-enable interrupts Re-enable interrupts

    The pending IPI is noted
    immediately, but alas,
    the CPU is offline at
    this point.

    This of course, makes the smp-call-function IPI handler code running on
    CPU 2 unhappy and it complains about "receiving an IPI on an offline
    CPU".

    One real example of the scenario on CPU 1 is the block layer's
    complete-request call-path:

    __blk_complete_request() [interrupt-handler]
    raise_blk_irq()
    smp_call_function_single_async()

    However, if we look closely, the block layer does check that the target
    CPU is online before firing the IPI. So in this case, it is actually
    the unfortunate ordering/timing of events in the stop-machine phase that
    leads to receiving IPIs after the target CPU has gone offline.

    In reality, getting a late IPI on an offline CPU is not too bad by
    itself (this can happen even due to hardware latencies in IPI
    send-receive). It is a bug only if the target CPU really went offline
    without executing all the callbacks queued on its list. (Note that a
    CPU is free to execute its pending smp-call-function callbacks in a
    batch, without waiting for the corresponding IPIs to arrive for each one
    of those callbacks).

    So, fixing this issue can be broken up into two parts:

    1. Ensure that a CPU goes offline only after executing all the
    callbacks queued on it.

    2. Modify the warning condition in the smp-call-function IPI handler
    code such that it warns only if an offline CPU got an IPI *and* that
    CPU had gone offline with callbacks still pending in its queue.

    Achieving part 1 is straight-forward - just flush (execute) all the
    queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
    including those callbacks for which the source CPU's IPIs might not have
    been received on the outgoing CPU yet. Once we do this, an IPI that
    arrives late on the CPU going offline (either due to the race mentioned
    above, or due to hardware latencies) will be completely harmless, since
    the outgoing CPU would have executed all the queued callbacks before
    going offline.

    Overall, this fix (parts 1 and 2 put together) additionally guarantees
    that we will see a warning only when the *IPI-sender code* is buggy -
    that is, if it queues the callback _after_ the target CPU has gone
    offline.

    [1]. The CPU_DYING part needs a little more explanation: by the time we
    execute the CPU_DYING notifier callbacks, the CPU would have already
    been marked offline. But we want to flush out the pending callbacks at
    this stage, ignoring the fact that the CPU is offline. So restructure
    the IPI handler code so that we can by-pass the "is-cpu-offline?" check
    in this particular case. (Of course, the right solution here is to fix
    CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
    notifiers, but this requires a lot of audit to ensure that this change
    doesn't break any existing code; hence lets go with the solution
    proposed above until that is done).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Srivatsa S. Bhat
    Suggested-by: Frederic Weisbecker
    Cc: "Paul E. McKenney"
    Cc: Borislav Petkov
    Cc: Christoph Hellwig
    Cc: Frederic Weisbecker
    Cc: Gautham R Shenoy
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Tested-by: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     

16 Jun, 2014

1 commit

  • irq work currently only supports local callbacks. However its code
    is mostly ready to run remote callbacks and we have some potential user.

    The full nohz subsystem currently open codes its own remote irq work
    on top of the scheduler ipi when it wants a CPU to reevaluate its next
    tick. However this ad hoc solution bloats the scheduler IPI.

    Lets just extend the irq work subsystem to support remote queuing on top
    of the generic SMP IPI to handle this kind of user. This shouldn't add
    noticeable overhead.

    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

07 Jun, 2014

1 commit

  • There is a longstanding problem related to CPU hotplug which causes IPIs
    to be delivered to offline CPUs, and the smp-call-function IPI handler
    code prints out a warning whenever this is detected. Every once in a
    while this (usually harmless) warning gets reported on LKML, but so far
    it has not been completely fixed. Usually the solution involves finding
    out the IPI sender and fixing it by adding appropriate synchronization
    with CPU hotplug.

    However, while going through one such internal bug reports, I found that
    there is a significant bug in the receiver side itself (more
    specifically, in stop-machine) that can lead to this problem even when
    the sender code is perfectly fine. This patchset fixes that
    synchronization problem in the CPU hotplug stop-machine code.

    Patch 1 adds some additional debug code to the smp-call-function
    framework, to help debug such issues easily.

    Patch 2 modifies the stop-machine code to ensure that any IPIs that were
    sent while the target CPU was online, would be noticed and handled by
    that CPU without fail before it goes offline. Thus, this avoids
    scenarios where IPIs are received on offline CPUs (as long as the sender
    uses proper hotplug synchronization).

    In fact, I debugged the problem by using Patch 1, and found that the
    payload of the IPI was always the block layer's trigger_softirq()
    function. But I was not able to find anything wrong with the block
    layer code. That's when I started looking at the stop-machine code and
    realized that there is a race-window which makes the IPI _receiver_ the
    culprit, not the sender. Patch 2 fixes that race and hence this should
    put an end to most of the hard-to-debug IPI-to-offline-CPU issues.

    This patch (of 2):

    Today the smp-call-function code just prints a warning if we get an IPI
    on an offline CPU. This info is sufficient to let us know that
    something went wrong, but often it is very hard to debug exactly who
    sent the IPI and why, from this info alone.

    In most cases, we get the warning about the IPI to an offline CPU,
    immediately after the CPU going offline comes out of the stop-machine
    phase and reenables interrupts. Since all online CPUs participate in
    stop-machine, the information regarding the sender of the IPI is already
    lost by the time we exit the stop-machine loop. So even if we dump the
    stack on each CPU at this point, we won't find anything useful since all
    of them will show the stack-trace of the stopper thread. So we need a
    better way to figure out who sent the IPI and why.

    To achieve this, when we detect an IPI targeted to an offline CPU, loop
    through the call-single-data linked list and print out the payload
    (i.e., the name of the function which was supposed to be executed by the
    target CPU). This would give us an insight as to who might have sent
    the IPI and help us debug this further.

    [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
    Signed-off-by: Srivatsa S. Bhat
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Rusty Russell
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Borislav Petkov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Gautham R Shenoy
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     

25 Feb, 2014

6 commits

  • The name __smp_call_function_single() doesn't tell much about the
    properties of this function, especially when compared to
    smp_call_function_single().

    The comments above the implementation are also misleading. The main
    point of this function is actually not to be able to embed the csd
    in an object. This is actually a requirement that result from the
    purpose of this function which is to raise an IPI asynchronously.

    As such it can be called with interrupts disabled. And this feature
    comes at the cost of the caller who then needs to serialize the
    IPIs on this csd.

    Lets rename the function and enhance the comments so that they reflect
    these properties.

    Suggested-by: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     
  • The main point of calling __smp_call_function_single() is to send
    an IPI in a pure asynchronous way. By embedding a csd in an object,
    a caller can send the IPI without waiting for a previous one to complete
    as is required by smp_call_function_single() for example. As such,
    sending this kind of IPI can be safe even when irqs are disabled.

    This flexibility comes at the expense of the caller who then needs to
    synchronize the csd lifecycle by himself and make sure that IPIs on a
    single csd are serialized.

    This is how __smp_call_function_single() works when wait = 0 and this
    usecase is relevant.

    Now there don't seem to be any usecase with wait = 1 that can't be
    covered by smp_call_function_single() instead, which is safer. Lets look
    at the two possible scenario:

    1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
    in an object. It looks like a nice and convenient pattern at the first
    sight because we can then retrieve the object from the IPI handler easily.

    But actually it is a waste of memory space in the object since the csd
    can be allocated from the stack by smp_call_function_single(wait = 1)
    and the object can be passed an the IPI argument.

    Besides that, embedding the csd in an object is more error prone
    because the caller must take care of the serialization of the IPIs
    for this csd.

    2) The user calls __smp_call_function_single(wait = 1) on a csd that
    is allocated on the stack. It's ok but smp_call_function_single()
    can do it as well and it already takes care of the allocation on the
    stack. Again it's more simple and less error prone.

    Therefore, using the underscore prepend API version with wait = 1
    is a bad pattern and a sign that the caller can do safer and more
    simple.

    There was a single user of that which has just been converted.
    So lets remove this option to discourage further users.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     
  • Move this function closer to __smp_call_function_single(). These functions
    have very similar behavior and should be displayed in the same block
    for clarity.

    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     
  • __smp_call_function_single() and smp_call_function_single() share some
    code that can be factorized: execute inline when the target is local,
    check if the target is online, lock the csd, call generic_exec_single().

    Lets move the common parts to generic_exec_single().

    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Frederic Weisbecker
     
  • Align __smp_call_function_single() with smp_call_function_single() so
    that it also checks whether requested cpu is still online.

    Signed-off-by: Jan Kara
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • The IPI function llist iteration is open coded. Lets simplify this
    with using an llist iterator.

    Also we want to keep the iteration safe against possible
    csd.llist->next value reuse from the IPI handler. At least the block
    subsystem used to do such things so lets stay careful and use
    llist_for_each_entry_safe().

    Signed-off-by: Jan Kara
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Jens Axboe

    Jan Kara
     

31 Jan, 2014

2 commits

  • After commit 9a46ad6d6df3 ("smp: make smp_call_function_many() use logic
    similar to smp_call_function_single()"), cfd->cpumask is accessed only
    in smp_call_function_many(). So there is no more need to copy it into
    cfd->cpumask_ipi before putting csd into the list. The cpumask_ipi
    field is obsolete and can be removed.

    Signed-off-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Wang YanQing
    Cc: Xie XiuQi
    Cc: Shaohua Li
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Make smp_call_function_single and friends more efficient by using a
    lockless list.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

15 Nov, 2013

2 commits