13 Mar, 2014

1 commit

  • Commit 73f7d1ca3263 (ACPI / init: Run acpi_early_init() before
    timekeeping_init()) optimistically moved the early ACPI initialization
    before timekeeping_init(), but that didn't work, because it broke fast
    TSC calibration for Julian Wollrath on Thinkpad x121e (and most likely
    for others too). The reason is that acpi_early_init() enables the SCI
    and that interferes with the fast TSC calibration mechanism.

    Thus follow the original idea to execute acpi_early_init() before
    efi_enter_virtual_mode() to help the EFI people for now and we can
    revisit the other problem that commit 73f7d1ca3263 attempted to
    address in the future (if really necessary).

    Fixes: 73f7d1ca3263 (ACPI / init: Run acpi_early_init() before timekeeping_init())
    Reported-by: Julian Wollrath
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

06 Feb, 2014

1 commit

  • This changes 'do_execve()' to get the executable name as a 'struct
    filename', and to free it when it is done. This is what the normal
    users want, and it simplifies and streamlines their error handling.

    The controlled lifetime of the executable name also fixes a
    use-after-free problem with the trace_sched_process_exec tracepoint: the
    lifetime of the passed-in string for kernel users was not at all
    obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
    the pathname allocation lifetime with the execve() having finished,
    which in turn meant that the trace point that happened after
    mm_release() of the old process VM ended up using already free'd memory.

    To solve the kernel string lifetime issue, this simply introduces
    "getname_kernel()" that works like the normal user-space getname()
    function, except with the source coming from kernel memory.

    As Oleg points out, this also means that we could drop the tcomm[] array
    from 'struct linux_binprm', since the pathname lifetime now covers
    setup_new_exec(). That would be a separate cleanup.

    Reported-by: Igor Zhbanov
    Tested-by: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jan, 2014

1 commit


25 Jan, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "As far as the number of commits goes, the top spot belongs to ACPI
    this time with cpufreq in the second position and a handful of PM
    core, PNP and cpuidle updates. They are fixes and cleanups mostly, as
    usual, with a couple of new features in the mix.

    The most visible change is probably that we will create struct
    acpi_device objects (visible in sysfs) for all devices represented in
    the ACPI tables regardless of their status and there will be a new
    sysfs attribute under those objects allowing user space to check that
    status via _STA.

    Consequently, ACPI device eject or generally hot-removal will not
    delete those objects, unless the table containing the corresponding
    namespace nodes is unloaded, which is extremely rare. Also ACPI
    container hotplug will be handled quite a bit differently and cpufreq
    will support CPU boost ("turbo") generically and not only in the
    acpi-cpufreq driver.

    Specifics:

    - ACPI core changes to make it create a struct acpi_device object for
    every device represented in the ACPI tables during all namespace
    scans regardless of the current status of that device. In
    accordance with this, ACPI hotplug operations will not delete those
    objects, unless the underlying ACPI tables go away.

    - On top of the above, new sysfs attribute for ACPI device objects
    allowing user space to check device status by triggering the
    execution of _STA for its ACPI object. From Srinivas Pandruvada.

    - ACPI core hotplug changes reducing code duplication, integrating
    the PCI root hotplug with the core and reworking container hotplug.

    - ACPI core simplifications making it use ACPI_COMPANION() in the
    code "glueing" ACPI device objects to "physical" devices.

    - ACPICA update to upstream version 20131218. This adds support for
    the DBG2 and PCCT tables to ACPICA, fixes some bugs and improves
    debug facilities. From Bob Moore, Lv Zheng and Betty Dall.

    - Init code change to carry out the early ACPI initialization
    earlier. That should allow us to use ACPI during the timekeeping
    initialization and possibly to simplify the EFI initialization too.
    From Chun-Yi Lee.

    - Clenups of the inclusions of ACPI headers in many places all over
    from Lv Zheng and Rashika Kheria (work in progress).

    - New helper for ACPI _DSM execution and rework of the code in
    drivers that uses _DSM to execute it via the new helper. From
    Jiang Liu.

    - New Win8 OSI blacklist entries from Takashi Iwai.

    - Assorted ACPI fixes and cleanups from Al Stone, Emil Goode, Hanjun
    Guo, Lan Tianyu, Masanari Iida, Oliver Neukum, Prarit Bhargava,
    Rashika Kheria, Tang Chen, Zhang Rui.

    - intel_pstate driver updates, including proper Baytrail support,
    from Dirk Brandewie and intel_pstate documentation from Ramkumar
    Ramachandra.

    - Generic CPU boost ("turbo") support for cpufreq from Lukasz
    Majewski.

    - powernow-k6 cpufreq driver fixes from Mikulas Patocka.

    - cpufreq core fixes and cleanups from Viresh Kumar, Jane Li, Mark
    Brown.

    - Assorted cpufreq drivers fixes and cleanups from Anson Huang, John
    Tobias, Paul Bolle, Paul Walmsley, Sachin Kamat, Shawn Guo, Viresh
    Kumar.

    - cpuidle cleanups from Bartlomiej Zolnierkiewicz.

    - Support for hibernation APM events from Bin Shi.

    - Hibernation fix to avoid bringing up nonboot CPUs with ACPI EC
    disabled during thaw transitions from Bjørn Mork.

    - PM core fixes and cleanups from Ben Dooks, Leonardo Potenza, Ulf
    Hansson.

    - PNP subsystem fixes and cleanups from Dmitry Torokhov, Levente
    Kurusa, Rashika Kheria.

    - New tool for profiling system suspend from Todd E Brandt and a
    cpupower tool cleanup from One Thousand Gnomes"

    * tag 'pm+acpi-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (153 commits)
    thermal: exynos: boost: Automatic enable/disable of BOOST feature (at Exynos4412)
    cpufreq: exynos4x12: Change L0 driver data to CPUFREQ_BOOST_FREQ
    Documentation: cpufreq / boost: Update BOOST documentation
    cpufreq: exynos: Extend Exynos cpufreq driver to support boost
    cpufreq / boost: Kconfig: Support for software-managed BOOST
    acpi-cpufreq: Adjust the code to use the common boost attribute
    cpufreq: Add boost frequency support in core
    intel_pstate: Add trace point to report internal state.
    cpufreq: introduce cpufreq_generic_get() routine
    ARM: SA1100: Create dummy clk_get_rate() to avoid build failures
    cpufreq: stats: create sysfs entries when cpufreq_stats is a module
    cpufreq: stats: free table and remove sysfs entry in a single routine
    cpufreq: stats: remove hotplug notifiers
    cpufreq: stats: handle cpufreq_unregister_driver() and suspend/resume properly
    cpufreq: speedstep: remove unused speedstep_get_state
    platform: introduce OF style 'modalias' support for platform bus
    PM / tools: new tool for suspend/resume performance optimization
    ACPI: fix module autoloading for ACPI enumerated devices
    ACPI: add module autoloading support for ACPI enumerated devices
    ACPI: fix create_modalias() return value handling
    ...

    Linus Torvalds
     

24 Jan, 2014

2 commits


22 Jan, 2014

2 commits

  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fall back to
    exiting bootmem APIs.

    Signed-off-by: Santosh Shilimkar
    Cc: Yinghai Lu
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: Grygorii Strashko
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tony Lindgren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    To make sure that it really works this time, some numbers from my test
    machine (just booted, no load):

    Before:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92
    After:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9
    kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0

    Note that the patch is useful not only for debug case, but also for
    PREEMPT_RT, where spinlock_t is always bloated.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2014

1 commit

  • This is a variant patch from Rafael J. Wysocki's
    ACPI / init: Run acpi_early_init() before efi_enter_virtual_mode()

    According to Matt Fleming, if acpi_early_init() was executed before
    efi_enter_virtual_mode(), the EFI initialization could benefit from
    it, so Rafael's patch makes that happen.

    And, we want accessing ACPI TAD device to set system clock, so move
    acpi_early_init() before timekeeping_init(). This final position is
    also before efi_enter_virtual_mode().

    Tested-by: Toshi Kani
    Signed-off-by: Lee, Chun-Yi
    Signed-off-by: Rafael J. Wysocki

    Lee, Chun-Yi
     

21 Nov, 2013

1 commit

  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Nov, 2013

2 commits

  • Pull module updates from Rusty Russell:
    "Mainly boring here, too. rmmod --wait finally removed, though"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modpost: fix bogus 'exported twice' warnings.
    init: fix in-place parameter modification regression
    asmlinkage, module: Make ksymtab and kcrctab symbols and __this_module __visible
    kernel: add support for init_array constructors
    modpost: Optionally ignore secondary errors seen if a single module build fails
    module: remove rmmod --wait option.

    Linus Torvalds
     
  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

3 commits

  • Pull networking updates from David Miller:

    1) The addition of nftables. No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations. For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset. In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

    2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

    3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

    4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

    5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

    6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

    7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option. From Eric Dumazet.

    8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

    9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too. Implementation from Shawn
    Bohrer.

    10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets. With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention. From Eric Dumazet.

    11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations. From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

    12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels. From Eric Dumazet.

    13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies. From Hannes Frederic Sowa. The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

    14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more. Many drivers have been cleaned
    up in this way, from Jingoo Han.

    15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

    16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled. Also from Daniel
    Borkmann.

    17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value. This helps avoid PMTU attacks,
    particularly on DNS servers. From Hannes Frederic Sowa.

    18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net. From Jason Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    random32: add test cases for taus113 implementation
    random32: upgrade taus88 generator to taus113 from errata paper
    random32: move rnd_state to linux/random.h
    random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
    random32: add periodic reseeding
    random32: fix off-by-one in seeding requirement
    PHY: Add RTL8201CP phy_driver to realtek
    xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
    macmace: add missing platform_set_drvdata() in mace_probe()
    ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
    ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
    vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
    ixgbe: add warning when max_vfs is out of range.
    igb: Update link modes display in ethtool
    netfilter: push reasm skb through instead of original frag skbs
    ip6_output: fragment outgoing reassembled skb properly
    MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
    net_sched: tbf: support of 64bit rates
    ixgbe: deleting dfwd stations out of order can cause null ptr deref
    ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
    ...

    Linus Torvalds
     
  • This patch proposes to make init failures more explicit.

    Before this, the "No init found" message didn't help much. It could
    sometimes be misleading and actually mean "No *working* init found".

    This message could hide many different issues:
    - no init program candidates found at all
    - some init program candidates exist but can't be executed (missing
    execute permissions, failed to load shared libraries, executable
    compiled for an unknown architecture...)

    This patch notifies the kernel user when a candidate init program is found
    but can't be executed. In each failure situation, the error code is
    displayed, to quickly find the root cause. "No init found" is also
    replaced by "No working init found", which is more correct.

    This will help embedded Linux developers (especially the newcomers),
    regularly making and debugging new root filesystems.

    Credits to Geert Uytterhoeven and Janne Karhunen for their improvement
    suggestions.

    Signed-off-by: Michael Opdenacker
    Cc: Geert Uytterhoeven
    Tested-by: Janne Karhunen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Opdenacker
     
  • It's already available in

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

31 Oct, 2013

1 commit

  • Before commit 026cee0086fe1df4cf74691cf273062cc769617d
    ("params: _initcall-like kernel parameters") the __setup
    parameter parsing code could modify parameter in the
    static_command_line buffer and such modifications were kept. After
    that commit such modifications are destroyed during per-initcall level
    parameter parsing because the same static_command_line buffer is used
    and only parameters for appropriate initcall level are parsed.

    That change broke at least parsing "ubd" parameter in the ubd driver
    when the COW file is used.

    Now the separate buffer is used for per-initcall parameter parsing.

    Signed-off-by: Krzysztof Mazur
    Signed-off-by: Rusty Russell

    Krzysztof Mazur
     

24 Oct, 2013

1 commit


20 Oct, 2013

1 commit

  • Usage of the static key primitives to toggle a branch must not be used
    before jump_label_init() is called from init/main.c. jump_label_init
    reorganizes and wires up the jump_entries so usage before that could
    have unforeseen consequences.

    Following primitives are now checked for correct use:
    * static_key_slow_inc
    * static_key_slow_dec
    * static_key_slow_dec_deferred
    * jump_label_rate_limit

    The x86 architecture already checks this by testing if the default_nop
    was already replaced with an optimal nop or with a branch instruction. It
    will panic then. Other architectures don't check for this.

    Because we need to relax this check for the x86 arch to allow code to
    transition from default_nop to the enabled state and other architectures
    did not check for this at all this patch introduces checking on the
    static_key primitives in a non-arch dependent manner.

    All checked functions are considered slow-path so the additional check
    does no harm to performance.

    The warnings are best observed with earlyprintk.

    Based on a patch from Andi Kleen.

    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

11 Oct, 2013

2 commits


25 Sep, 2013

1 commit

  • Replace the single preempt_count() 'function' that's an lvalue with
    two proper functions:

    preempt_count() - returns the preempt_count value as rvalue
    preempt_count_set() - Allows setting the preempt-count value

    Also provide preempt_count_ptr() as a convenience wrapper to implement
    all modifying operations.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
    [ Fixed build failure. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Sep, 2013

1 commit


14 Aug, 2013

1 commit

  • Prepare for using a static key in the context tracking subsystem.
    This will help optimizing the off case on its many users:

    * user_enter, user_exit, exception_enter, exception_exit, guest_enter,
    guest_exit, vtime_*()

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     

07 Jul, 2013

1 commit

  • Pull timer core updates from Thomas Gleixner:
    "The timer changes contain:

    - posix timer code consolidation and fixes for odd corner cases

    - sched_clock implementation moved from ARM to core code to avoid
    duplication by other architectures

    - alarm timer updates

    - clocksource and clockevents unregistration facilities

    - clocksource/events support for new hardware

    - precise nanoseconds RTC readout (Xen feature)

    - generic support for Xen suspend/resume oddities

    - the usual lot of fixes and cleanups all over the place

    The parts which touch other areas (ARM/XEN) have been coordinated with
    the relevant maintainers. Though this results in an handful of
    trivial to solve merge conflicts, which we preferred over nasty cross
    tree merge dependencies.

    The patches which have been committed in the last few days are bug
    fixes plus the posix timer lot. The latter was in akpms queue and
    next for quite some time; they just got forgotten and Frederic
    collected them last minute."

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
    hrtimer: Remove unused variable
    hrtimers: Move SMP function call to thread context
    clocksource: Reselect clocksource when watchdog validated high-res capability
    posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
    posix_timers: fix racy timer delta caching on task exit
    posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
    selftests: add basic posix timers selftests
    posix_cpu_timers: consolidate expired timers check
    posix_cpu_timers: consolidate timer list cleanups
    posix_cpu_timer: consolidate expiry time type
    tick: Sanitize broadcast control logic
    tick: Prevent uncontrolled switch to oneshot mode
    tick: Make oneshot broadcast robust vs. CPU offlining
    x86: xen: Sync the CMOS RTC as well as the Xen wallclock
    x86: xen: Sync the wallclock when the system time is set
    timekeeping: Indicate that clock was set in the pvclock gtod notifier
    timekeeping: Pass flags instead of multiple bools to timekeeping_update()
    xen: Remove clock_was_set() call in the resume path
    hrtimers: Support resuming with two or more CPUs online (but stopped)
    timer: Fix jiffies wrap behavior of round_jiffies_common()
    ...

    Linus Torvalds
     

04 Jul, 2013

1 commit

  • do_one_initcall() uses a 64 byte string buffer to save a message. This
    buffer is declared static and is only used at boot up and when a module
    is loaded. As 64 bytes is very small, and this function has very limited
    scope, there's no reason to waste permanent memory with this string and
    not just simply put it on the stack.

    Signed-off-by: Steven Rostedt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

13 Jun, 2013

1 commit


28 May, 2013

1 commit

  • The current scheme of using the timer tick was fine for per-thread
    events. However, it was causing bias issues in system-wide mode
    (including for uncore PMUs). Event groups would not get their fair
    share of runtime on the PMU. With tickless kernels, if a core is idle
    there is no timer tick, and thus no event rotation (multiplexing).
    However, there are events (especially uncore events) which do count
    even though cores are asleep.

    This patch changes the timer source for multiplexing. It introduces a
    per-PMU per-cpu hrtimer. The advantage is that even when a core goes
    idle, it will come back to service the hrtimer, thus multiplexing on
    system-wide events works much better.

    The per-PMU implementation (suggested by PeterZ) enables adjusting the
    multiplexing interval per PMU. The preferred interval is stashed into
    the struct pmu. If not set, it will be forced to the default interval
    value.

    In order to minimize the impact of the hrtimer, it is turned on and
    off on demand. When the PMU on a CPU is overcommited, the hrtimer is
    activated. It is stopped when the PMU is not overcommitted.

    In order for this to work properly, we had to change the order of
    initialization in start_kernel() such that hrtimer_init() is run
    before perf_event_init().

    The default interval in milliseconds is set to a timer tick just like
    with the old code. We will provide a sysctl to tune this in another
    patch.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

02 May, 2013

2 commits

  • The full dynticks tree needs the latest RCU and sched
    upstream updates in order to fix some dependencies.

    Merge a common upstream merge point that has these
    updates.

    Conflicts:
    include/linux/perf_event.h
    kernel/rcutree.h
    kernel/rcutree_plugin.h

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Commit f91eb62f71b3 ("init: scream bloody murder if interrupts are
    enabled too early") added three new warnings. The first two seemed
    reasonable, but the third included a warning when an initcall returned
    non-zero. Although, the third WARN() does include an imbalanced preempt
    disabled, or irqs disable, it shouldn't warn if it only had an initcall
    that just returns non-zero.

    In fact, according to Linus, it shouldn't print at all. As it only
    prints with initcall_debug set, and that already shows enough
    information to fix things.

    Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.com

    Suggested-by: Linus Torvalds
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

30 Apr, 2013

5 commits

  • Pull core timer updates from Ingo Molnar:
    "The main changes in this cycle's merge are:

    - Implement shadow timekeeper to shorten in kernel reader side
    blocking, by Thomas Gleixner.

    - Posix timers enhancements by Pavel Emelyanov:

    - allocate timer ID per process, so that exact timer ID allocations
    can be re-created be checkpoint/restore code.

    - debuggability and tooling (/proc/PID/timers, etc.) improvements.

    - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
    processors (Penwell and Cloverview), there is a feature that the
    TSC won't stop in S3 state, so the TSC value won't be reset to 0
    after resume. This can be taken advantage of by the generic via
    the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
    recover/approximate sleep time, the main (and precise) clocksource
    can be used.

    - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
    CPUs the file goes beyond 4MB of size and thus the current
    simplistic seqfile approach fails. Convert /proc/timer_list to a
    proper seq_file with its own iterator.

    - Cleanups and refactorings of the core timekeeping code by John
    Stultz.

    - International Atomic Clock time is managed by the NTP code
    internally currently but not exposed externally. Separate the TAI
    code out and add CLOCK_TAI support and TAI support to the hrtimer
    and posix-timer code, by John Stultz.

    - Add deep idle support enhacement to the broadcast clockevents core
    timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
    clockevents feature (which will be utilized by future clockevents
    driver updates), which allows the use of IRQ affinities to avoid
    spurious wakeups of idle CPUs - the right CPU with an expiring
    timer will be woken.

    - Add new ARM bcm281xx clocksource driver, by Christian Daudt

    - ... various other fixes and cleanups"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    clockevents: Set dummy handler on CPU_DEAD shutdown
    timekeeping: Update tk->cycle_last in resume
    posix-timers: Remove unused variable
    clockevents: Switch into oneshot mode even if broadcast registered late
    timer_list: Convert timer list to be a proper seq_file
    timer_list: Split timer_list_show_tickdevices
    posix-timers: Show sigevent info in proc file
    posix-timers: Introduce /proc/PID/timers file
    posix timers: Allocate timer id per process (v2)
    timekeeping: Make sure to notify hrtimers when TAI offset changes
    hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
    hrtimer: Add expiry time overflow check in hrtimer_interrupt
    timekeeping: Shorten seq_count region
    timekeeping: Implement a shadow timekeeper
    timekeeping: Delay update of clock->cycle_last
    timekeeping: Store cycle_last value in timekeeper struct as well
    ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
    timekeeping: Simplify tai updating from do_adjtimex
    timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
    timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
    ...

    Linus Torvalds
     
  • Pull SMP/hotplug changes from Ingo Molnar:
    "This is a pretty large, multi-arch series unifying and generalizing
    the various disjunct pieces of idle routines that architectures have
    historically copied from each other and have grown in random, wildly
    inconsistent and sometimes buggy directions:

    101 files changed, 455 insertions(+), 1328 deletions(-)

    this went through a number of review and test iterations before it was
    committed, it was tested on various architectures, was exposed to
    linux-next for quite some time - nevertheless it might cause problems
    on architectures that don't read the mailing lists and don't regularly
    test linux-next.

    This cat herding excercise was motivated by the -rt kernel, and was
    brought to you by Thomas "the Whip" Gleixner."

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    idle: Remove GENERIC_IDLE_LOOP config switch
    um: Use generic idle loop
    ia64: Make sure interrupts enabled when we "safe_halt()"
    sparc: Use generic idle loop
    idle: Remove unused ARCH_HAS_DEFAULT_IDLE
    bfin: Fix typo in arch_cpu_idle()
    xtensa: Use generic idle loop
    x86: Use generic idle loop
    unicore: Use generic idle loop
    tile: Use generic idle loop
    tile: Enter idle with preemption disabled
    sh: Use generic idle loop
    score: Use generic idle loop
    s390: Use generic idle loop
    powerpc: Use generic idle loop
    parisc: Use generic idle loop
    openrisc: Use generic idle loop
    mn10300: Use generic idle loop
    mips: Use generic idle loop
    microblaze: Use generic idle loop
    ...

    Linus Torvalds
     
  • Also enables cleanup of some 80-col trickery.

    Cc: Richard Weinberger
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • If the kernel was booted with the "quiet" boot option we have currently no
    chance to see why an initrd fails. Change KERN_WARNING to KERN_ERR to see
    what is going on.

    Signed-off-by: Richard Weinberger
    Cc: "H. Peter Anvin"
    Cc: Rusty Russell
    Cc: Jim Cromie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • As I was testing a lot of my code recently, and having several
    "successes", I accidentally noticed in the dmesg this little line:

    start_kernel(): bug: interrupts were enabled *very* early, fixing it

    Sure enough, one of my patches two commits ago enabled interrupts early.
    The sad part here is that I never noticed it, and I ran several tests with
    ktest too, and ktest did not notice this line.

    What ktest looks for (and so does many other automated testing scripts) is
    a back trace produced by a WARN_ON() or BUG(). As a back trace was never
    produced, my buggy patch could have slipped into linux-next, or even
    worse, mainline.

    Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:

    PID hash table entries: 4096 (order: 3, 32768 bytes)
    __ex_table already sorted, skipping sort
    Checking aperture...
    No AGP bridge found
    Calgary: detecting Calgary via BIOS EBDA area
    Calgary: Unable to locate Rio Grande table in EBDA - bailing!
    Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
    ------------[ cut here ]------------
    WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
    Hardware name: To Be Filled By O.E.M.
    Interrupts were enabled *very* early, fixing it
    Modules linked in:
    Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
    Call Trace:
    warn_slowpath_common+0x83/0x9b
    warn_slowpath_fmt+0x46/0x48
    start_kernel+0x21e/0x415
    x86_64_start_reservations+0x10e/0x112
    x86_64_start_kernel+0x102/0x111
    ---[ end trace 007d8b0491b4f5d8 ]---
    Preemptible hierarchical RCU implementation.
    RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
    NR_IRQS:4352 nr_irqs:712 16
    Console: colour VGA+ 80x25
    console [ttyS0] enabled, bootconsole disabled

    Do you see it?

    The original version of this patch just slapped a WARN_ON() in there and
    kept the printk(). Ard van Breemen suggested using the WARN() interface,
    which makes the code a bit cleaner.

    Also, while examining other warnings in init/main.c, I found two other
    locations that deserve a bloody murder scream if their conditions are hit,
    and updated them accordingly.

    Signed-off-by: Steven Rostedt
    Cc: Ard van Breemen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

19 Apr, 2013

1 commit

  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

08 Apr, 2013

1 commit

  • For now this calls cpu_idle(), but in the long run we want to move the
    cpu bringup code to the core and therefor we add a state argument.

    Signed-off-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Rusty Russell
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Reviewed-by: Cc: Srivatsa S. Bhat
    Cc: Magnus Damm
    Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Mar, 2013

1 commit

  • To convert the clockevents code to cpumask_var_t we need to move the
    init call after the allocator setup.

    Clockevents are earliest registered from time_init() as they need
    interrupts being set up, so this is safe.

    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130306111537.304379448@linutronix.de
    Cc: Rusty Russell

    Thomas Gleixner
     

20 Feb, 2013

1 commit

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     

31 Jan, 2013

1 commit

  • Pull x86 fixes from Peter Anvin:
    "This is a collection of miscellaneous fixes, the most important one is
    the fix for the Samsung laptop bricking issue (auto-blacklisting the
    samsung-laptop driver); the efi_enabled() changes you see below are
    prerequisites for that fix.

    The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
    debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
    with I/O port references."

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    samsung-laptop: Disable on EFI hardware
    efi: Make 'efi_enabled' a function to query EFI facilities
    smp: Fix SMP function call empty cpu mask race
    x86/msr: Add capabilities check
    x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
    x86/olpc: Fix olpc-xo1-sci.c build errors
    arch/x86/platform/uv: Fix incorrect tlb flush all issue
    x86-64: Fix unwind annotations in recent NMI changes
    x86-32: Start out cr0 clean, disable paging before modifying cr3/4

    Linus Torvalds