07 Jul, 2013

1 commit

  • Pull timer core updates from Thomas Gleixner:
    "The timer changes contain:

    - posix timer code consolidation and fixes for odd corner cases

    - sched_clock implementation moved from ARM to core code to avoid
    duplication by other architectures

    - alarm timer updates

    - clocksource and clockevents unregistration facilities

    - clocksource/events support for new hardware

    - precise nanoseconds RTC readout (Xen feature)

    - generic support for Xen suspend/resume oddities

    - the usual lot of fixes and cleanups all over the place

    The parts which touch other areas (ARM/XEN) have been coordinated with
    the relevant maintainers. Though this results in an handful of
    trivial to solve merge conflicts, which we preferred over nasty cross
    tree merge dependencies.

    The patches which have been committed in the last few days are bug
    fixes plus the posix timer lot. The latter was in akpms queue and
    next for quite some time; they just got forgotten and Frederic
    collected them last minute."

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
    hrtimer: Remove unused variable
    hrtimers: Move SMP function call to thread context
    clocksource: Reselect clocksource when watchdog validated high-res capability
    posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
    posix_timers: fix racy timer delta caching on task exit
    posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
    selftests: add basic posix timers selftests
    posix_cpu_timers: consolidate expired timers check
    posix_cpu_timers: consolidate timer list cleanups
    posix_cpu_timer: consolidate expiry time type
    tick: Sanitize broadcast control logic
    tick: Prevent uncontrolled switch to oneshot mode
    tick: Make oneshot broadcast robust vs. CPU offlining
    x86: xen: Sync the CMOS RTC as well as the Xen wallclock
    x86: xen: Sync the wallclock when the system time is set
    timekeeping: Indicate that clock was set in the pvclock gtod notifier
    timekeeping: Pass flags instead of multiple bools to timekeeping_update()
    xen: Remove clock_was_set() call in the resume path
    hrtimers: Support resuming with two or more CPUs online (but stopped)
    timer: Fix jiffies wrap behavior of round_jiffies_common()
    ...

    Linus Torvalds
     

04 Jul, 2013

1 commit

  • do_one_initcall() uses a 64 byte string buffer to save a message. This
    buffer is declared static and is only used at boot up and when a module
    is loaded. As 64 bytes is very small, and this function has very limited
    scope, there's no reason to waste permanent memory with this string and
    not just simply put it on the stack.

    Signed-off-by: Steven Rostedt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

13 Jun, 2013

1 commit


28 May, 2013

1 commit

  • The current scheme of using the timer tick was fine for per-thread
    events. However, it was causing bias issues in system-wide mode
    (including for uncore PMUs). Event groups would not get their fair
    share of runtime on the PMU. With tickless kernels, if a core is idle
    there is no timer tick, and thus no event rotation (multiplexing).
    However, there are events (especially uncore events) which do count
    even though cores are asleep.

    This patch changes the timer source for multiplexing. It introduces a
    per-PMU per-cpu hrtimer. The advantage is that even when a core goes
    idle, it will come back to service the hrtimer, thus multiplexing on
    system-wide events works much better.

    The per-PMU implementation (suggested by PeterZ) enables adjusting the
    multiplexing interval per PMU. The preferred interval is stashed into
    the struct pmu. If not set, it will be forced to the default interval
    value.

    In order to minimize the impact of the hrtimer, it is turned on and
    off on demand. When the PMU on a CPU is overcommited, the hrtimer is
    activated. It is stopped when the PMU is not overcommitted.

    In order for this to work properly, we had to change the order of
    initialization in start_kernel() such that hrtimer_init() is run
    before perf_event_init().

    The default interval in milliseconds is set to a timer tick just like
    with the old code. We will provide a sysctl to tune this in another
    patch.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

02 May, 2013

2 commits

  • The full dynticks tree needs the latest RCU and sched
    upstream updates in order to fix some dependencies.

    Merge a common upstream merge point that has these
    updates.

    Conflicts:
    include/linux/perf_event.h
    kernel/rcutree.h
    kernel/rcutree_plugin.h

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Commit f91eb62f71b3 ("init: scream bloody murder if interrupts are
    enabled too early") added three new warnings. The first two seemed
    reasonable, but the third included a warning when an initcall returned
    non-zero. Although, the third WARN() does include an imbalanced preempt
    disabled, or irqs disable, it shouldn't warn if it only had an initcall
    that just returns non-zero.

    In fact, according to Linus, it shouldn't print at all. As it only
    prints with initcall_debug set, and that already shows enough
    information to fix things.

    Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.com

    Suggested-by: Linus Torvalds
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

30 Apr, 2013

5 commits

  • Pull core timer updates from Ingo Molnar:
    "The main changes in this cycle's merge are:

    - Implement shadow timekeeper to shorten in kernel reader side
    blocking, by Thomas Gleixner.

    - Posix timers enhancements by Pavel Emelyanov:

    - allocate timer ID per process, so that exact timer ID allocations
    can be re-created be checkpoint/restore code.

    - debuggability and tooling (/proc/PID/timers, etc.) improvements.

    - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
    processors (Penwell and Cloverview), there is a feature that the
    TSC won't stop in S3 state, so the TSC value won't be reset to 0
    after resume. This can be taken advantage of by the generic via
    the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
    recover/approximate sleep time, the main (and precise) clocksource
    can be used.

    - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
    CPUs the file goes beyond 4MB of size and thus the current
    simplistic seqfile approach fails. Convert /proc/timer_list to a
    proper seq_file with its own iterator.

    - Cleanups and refactorings of the core timekeeping code by John
    Stultz.

    - International Atomic Clock time is managed by the NTP code
    internally currently but not exposed externally. Separate the TAI
    code out and add CLOCK_TAI support and TAI support to the hrtimer
    and posix-timer code, by John Stultz.

    - Add deep idle support enhacement to the broadcast clockevents core
    timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
    clockevents feature (which will be utilized by future clockevents
    driver updates), which allows the use of IRQ affinities to avoid
    spurious wakeups of idle CPUs - the right CPU with an expiring
    timer will be woken.

    - Add new ARM bcm281xx clocksource driver, by Christian Daudt

    - ... various other fixes and cleanups"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    clockevents: Set dummy handler on CPU_DEAD shutdown
    timekeeping: Update tk->cycle_last in resume
    posix-timers: Remove unused variable
    clockevents: Switch into oneshot mode even if broadcast registered late
    timer_list: Convert timer list to be a proper seq_file
    timer_list: Split timer_list_show_tickdevices
    posix-timers: Show sigevent info in proc file
    posix-timers: Introduce /proc/PID/timers file
    posix timers: Allocate timer id per process (v2)
    timekeeping: Make sure to notify hrtimers when TAI offset changes
    hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
    hrtimer: Add expiry time overflow check in hrtimer_interrupt
    timekeeping: Shorten seq_count region
    timekeeping: Implement a shadow timekeeper
    timekeeping: Delay update of clock->cycle_last
    timekeeping: Store cycle_last value in timekeeper struct as well
    ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
    timekeeping: Simplify tai updating from do_adjtimex
    timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
    timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
    ...

    Linus Torvalds
     
  • Pull SMP/hotplug changes from Ingo Molnar:
    "This is a pretty large, multi-arch series unifying and generalizing
    the various disjunct pieces of idle routines that architectures have
    historically copied from each other and have grown in random, wildly
    inconsistent and sometimes buggy directions:

    101 files changed, 455 insertions(+), 1328 deletions(-)

    this went through a number of review and test iterations before it was
    committed, it was tested on various architectures, was exposed to
    linux-next for quite some time - nevertheless it might cause problems
    on architectures that don't read the mailing lists and don't regularly
    test linux-next.

    This cat herding excercise was motivated by the -rt kernel, and was
    brought to you by Thomas "the Whip" Gleixner."

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    idle: Remove GENERIC_IDLE_LOOP config switch
    um: Use generic idle loop
    ia64: Make sure interrupts enabled when we "safe_halt()"
    sparc: Use generic idle loop
    idle: Remove unused ARCH_HAS_DEFAULT_IDLE
    bfin: Fix typo in arch_cpu_idle()
    xtensa: Use generic idle loop
    x86: Use generic idle loop
    unicore: Use generic idle loop
    tile: Use generic idle loop
    tile: Enter idle with preemption disabled
    sh: Use generic idle loop
    score: Use generic idle loop
    s390: Use generic idle loop
    powerpc: Use generic idle loop
    parisc: Use generic idle loop
    openrisc: Use generic idle loop
    mn10300: Use generic idle loop
    mips: Use generic idle loop
    microblaze: Use generic idle loop
    ...

    Linus Torvalds
     
  • Also enables cleanup of some 80-col trickery.

    Cc: Richard Weinberger
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • If the kernel was booted with the "quiet" boot option we have currently no
    chance to see why an initrd fails. Change KERN_WARNING to KERN_ERR to see
    what is going on.

    Signed-off-by: Richard Weinberger
    Cc: "H. Peter Anvin"
    Cc: Rusty Russell
    Cc: Jim Cromie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • As I was testing a lot of my code recently, and having several
    "successes", I accidentally noticed in the dmesg this little line:

    start_kernel(): bug: interrupts were enabled *very* early, fixing it

    Sure enough, one of my patches two commits ago enabled interrupts early.
    The sad part here is that I never noticed it, and I ran several tests with
    ktest too, and ktest did not notice this line.

    What ktest looks for (and so does many other automated testing scripts) is
    a back trace produced by a WARN_ON() or BUG(). As a back trace was never
    produced, my buggy patch could have slipped into linux-next, or even
    worse, mainline.

    Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:

    PID hash table entries: 4096 (order: 3, 32768 bytes)
    __ex_table already sorted, skipping sort
    Checking aperture...
    No AGP bridge found
    Calgary: detecting Calgary via BIOS EBDA area
    Calgary: Unable to locate Rio Grande table in EBDA - bailing!
    Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
    ------------[ cut here ]------------
    WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
    Hardware name: To Be Filled By O.E.M.
    Interrupts were enabled *very* early, fixing it
    Modules linked in:
    Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
    Call Trace:
    warn_slowpath_common+0x83/0x9b
    warn_slowpath_fmt+0x46/0x48
    start_kernel+0x21e/0x415
    x86_64_start_reservations+0x10e/0x112
    x86_64_start_kernel+0x102/0x111
    ---[ end trace 007d8b0491b4f5d8 ]---
    Preemptible hierarchical RCU implementation.
    RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
    NR_IRQS:4352 nr_irqs:712 16
    Console: colour VGA+ 80x25
    console [ttyS0] enabled, bootconsole disabled

    Do you see it?

    The original version of this patch just slapped a WARN_ON() in there and
    kept the printk(). Ard van Breemen suggested using the WARN() interface,
    which makes the code a bit cleaner.

    Also, while examining other warnings in init/main.c, I found two other
    locations that deserve a bloody murder scream if their conditions are hit,
    and updated them accordingly.

    Signed-off-by: Steven Rostedt
    Cc: Ard van Breemen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

19 Apr, 2013

1 commit

  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

08 Apr, 2013

1 commit

  • For now this calls cpu_idle(), but in the long run we want to move the
    cpu bringup code to the core and therefor we add a state argument.

    Signed-off-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Rusty Russell
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Reviewed-by: Cc: Srivatsa S. Bhat
    Cc: Magnus Damm
    Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Mar, 2013

1 commit

  • To convert the clockevents code to cpumask_var_t we need to move the
    init call after the allocator setup.

    Clockevents are earliest registered from time_init() as they need
    interrupts being set up, so this is safe.

    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130306111537.304379448@linutronix.de
    Cc: Rusty Russell

    Thomas Gleixner
     

20 Feb, 2013

1 commit

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     

31 Jan, 2013

2 commits

  • Pull x86 fixes from Peter Anvin:
    "This is a collection of miscellaneous fixes, the most important one is
    the fix for the Samsung laptop bricking issue (auto-blacklisting the
    samsung-laptop driver); the efi_enabled() changes you see below are
    prerequisites for that fix.

    The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
    debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
    with I/O port references."

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    samsung-laptop: Disable on EFI hardware
    efi: Make 'efi_enabled' a function to query EFI facilities
    smp: Fix SMP function call empty cpu mask race
    x86/msr: Add capabilities check
    x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
    x86/olpc: Fix olpc-xo1-sci.c build errors
    arch/x86/platform/uv: Fix incorrect tlb flush all issue
    x86-64: Fix unwind annotations in recent NMI changes
    x86-32: Start out cr0 clean, disable paging before modifying cr3/4

    Linus Torvalds
     
  • Originally 'efi_enabled' indicated whether a kernel was booted from
    EFI firmware. Over time its semantics have changed, and it now
    indicates whether or not we are booted on an EFI machine with
    bit-native firmware, e.g. 64-bit kernel with 64-bit firmware.

    The immediate motivation for this patch is the bug report at,

    https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557

    which details how running a platform driver on an EFI machine that is
    designed to run under BIOS can cause the machine to become
    bricked. Also, the following report,

    https://bugzilla.kernel.org/show_bug.cgi?id=47121

    details how running said driver can also cause Machine Check
    Exceptions. Drivers need a new means of detecting whether they're
    running on an EFI machine, as sadly the expression,

    if (!efi_enabled)

    hasn't been a sufficient condition for quite some time.

    Users actually want to query 'efi_enabled' for different reasons -
    what they really want access to is the list of available EFI
    facilities.

    For instance, the x86 reboot code needs to know whether it can invoke
    the ResetSystem() function provided by the EFI runtime services, while
    the ACPI OSL code wants to know whether the EFI config tables were
    mapped successfully. There are also checks in some of the platform
    driver code to simply see if they're running on an EFI machine (which
    would make it a bad idea to do BIOS-y things).

    This patch is a prereq for the samsung-laptop fix patch.

    Cc: David Airlie
    Cc: Corentin Chary
    Cc: Matthew Garrett
    Cc: Dave Jiang
    Cc: Olof Johansson
    Cc: Peter Jones
    Cc: Colin Ian King
    Cc: Steve Langasek
    Cc: Tony Luck
    Cc: Konrad Rzeszutek Wilk
    Cc: Rafael J. Wysocki
    Cc:
    Signed-off-by: Matt Fleming
    Signed-off-by: H. Peter Anvin

    Matt Fleming
     

24 Jan, 2013

1 commit


19 Jan, 2013

1 commit

  • This patch adds default module loading and uses it to load the default
    block elevator. During boot, it's called right after initramfs or
    initrd is made available and right before control is passed to
    userland. This ensures that as long as the modules are available in
    the usual places in initramfs, initrd or the root filesystem, the
    default modules are loaded as soon as possible.

    This will replace the on-demand elevator module loading from elevator
    init path.

    v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
    robot.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang We

    Tejun Heo
     

26 Dec, 2012

1 commit

  • Commit d6b2123802d "make sure that we always have a return path from
    kernel_execve()" reshuffled kernel_init()/init_post() to ensure that
    kernel_execve() has a caller to return to.

    It removed __init annotation for kernel_init() and introduced/calls a
    new routine kernel_init_freeable(). Latter however is inlined by any
    reasonable compiler (ARC gcc 4.4 in this case), causing slight code
    bloat.

    This patch forces kernel_init_freeable() as noinline reducing the .text

    bloat-o-meter vmlinux vmlinux_new
    add/remove: 1/0 grow/shrink: 0/1 up/down: 374/-334 (40)
    function old new delta
    kernel_init_freeable - 374 +374 (.init.text)
    kernel_init 628 294 -334 (.text)

    Signed-off-by: Al Viro

    Vineet Gupta
     

21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     

20 Dec, 2012

1 commit

  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

16 Dec, 2012

1 commit

  • This reverts commit bd52276fa1d4 ("x86-64/efi: Use EFI to deal with
    platform wall clock (again)"), and the two supporting commits:

    da5a108d05b4: "x86/kernel: remove tboot 1:1 page table creation code"

    185034e72d59: "x86, efi: 1:1 pagetable mapping for virtual EFI calls")

    as they all depend semantically on commit 53b87cf088e2 ("x86, mm:
    Include the entire kernel memory map in trampoline_pgd") that got
    reverted earlier due to the problems it caused.

    This was pointed out by Yinghai Lu, and verified by me on my Macbook Air
    that uses EFI.

    Pointed-out-by: Yinghai Lu
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Dec, 2012

1 commit

  • Pull x86 EFI update from Peter Anvin:
    "EFI tree, from Matt Fleming. Most of the patches are the new efivarfs
    filesystem by Matt Garrett & co. The balance are support for EFI
    wallclock in the absence of a hardware-specific driver, and various
    fixes and cleanups."

    * 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    efivarfs: Make efivarfs_fill_super() static
    x86, efi: Check table header length in efi_bgrt_init()
    efivarfs: Use query_variable_info() to limit kmalloc()
    efivarfs: Fix return value of efivarfs_file_write()
    efivarfs: Return a consistent error when efivarfs_get_inode() fails
    efivarfs: Make 'datasize' unsigned long
    efivarfs: Add unique magic number
    efivarfs: Replace magic number with sizeof(attributes)
    efivarfs: Return an error if we fail to read a variable
    efi: Clarify GUID length calculations
    efivarfs: Implement exclusive access for {get,set}_variable
    efivarfs: efivarfs_fill_super() ensure we clean up correctly on error
    efivarfs: efivarfs_fill_super() ensure we free our temporary name
    efivarfs: efivarfs_fill_super() fix inode reference counts
    efivarfs: efivarfs_create() ensure we drop our reference on inode on error
    efivarfs: efivarfs_file_read ensure we free data in error paths
    x86-64/efi: Use EFI to deal with platform wall clock (again)
    x86/kernel: remove tboot 1:1 page table creation code
    x86, efi: 1:1 pagetable mapping for virtual EFI calls
    x86, mm: Include the entire kernel memory map in trampoline_pgd
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

19 Nov, 2012

1 commit

  • Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
    for the system init process, and another way for pid namespace
    init processes test pid->nr == 1 and use the same code for both.

    For the global init this results in SIGNAL_UNKILLABLE being set
    much earlier in the initialization process.

    This is a small cleanup and it paves the way for allowing unshare and
    enter of the pid namespace as that path like our global init also will
    not set CLONE_NEWPID.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

02 Nov, 2012

1 commit


30 Oct, 2012

1 commit

  • Other than ix86, x86-64 on EFI so far didn't set the
    {g,s}et_wallclock accessors to the EFI routines, thus
    incorrectly using raw RTC accesses instead.

    Simply removing the #ifdef around the respective code isn't
    enough, however: While so far early get-time calls were done in
    physical mode, this doesn't work properly for x86-64, as virtual
    addresses would still need to be set up for all runtime regions
    (which wasn't the case on the system I have access to), so
    instead the patch moves the call to efi_enter_virtual_mode()
    ahead (which in turn allows to drop all code related to calling
    efi-get-time in physical mode).

    Additionally the earlier calling of efi_set_executable()
    requires the CPA code to cope, i.e. during early boot it must be
    avoided to call cpa_flush_array(), as the first thing this
    function does is a BUG_ON(irqs_disabled()).

    Also make the two EFI functions in question here static -
    they're not being referenced elsewhere.

    History:

    This commit was originally merged as bacef661acdb ("x86-64/efi:
    Use EFI to deal with platform wall clock") but it resulted in some
    ASUS machines no longer booting due to a firmware bug, and so was
    reverted in f026cfa82f62. A pre-emptive fix for the buggy ASUS
    firmware was merged in 03a1c254975e ("x86, efi: 1:1 pagetable
    mapping for virtual EFI calls") so now this patch can be
    reapplied.

    Signed-off-by: Jan Beulich
    Tested-by: Matt Fleming
    Acked-by: Matthew Garrett
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: H. Peter Anvin
    Signed-off-by: Matt Fleming [added commit history]

    Jan Beulich
     

13 Oct, 2012

2 commits

  • Pull third pile of kernel_execve() patches from Al Viro:
    "The last bits of infrastructure for kernel_thread() et.al., with
    alpha/arm/x86 use of those. Plus sanitizing the asm glue and
    do_notify_resume() on alpha, fixing the "disabled irq while running
    task_work stuff" breakage there.

    At that point the rest of kernel_thread/kernel_execve/sys_execve work
    can be done independently for different architectures. The only
    pending bits that do depend on having all architectures converted are
    restrictred to fs/* and kernel/* - that'll obviously have to wait for
    the next cycle.

    I thought we'd have to wait for all of them done before we start
    eliminating the longjump-style insanity in kernel_execve(), but it
    turned out there's a very simple way to do that without flagday-style
    changes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to saner kernel_execve() semantics
    arm: switch to saner kernel_execve() semantics
    x86, um: convert to saner kernel_execve() semantics
    infrastructure for saner ret_from_kernel_thread semantics
    make sure that kernel_thread() callbacks call do_exit() themselves
    make sure that we always have a return path from kernel_execve()
    ppc: eeh_event should just use kthread_run()
    don't bother with kernel_thread/kernel_execve for launching linuxrc
    alpha: get rid of switch_stack argument of do_work_pending()
    alpha: don't bother passing switch_stack separately from regs
    alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
    alpha: simplify TIF_NEED_RESCHED handling

    Linus Torvalds
     
  • * allow kernel_execve() leave the actual return to userland to
    caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
    updated accordingly.
    * architecture that does select GENERIC_KERNEL_EXECVE in its
    Kconfig should have its ret_from_kernel_thread() do this:
    call schedule_tail
    call the callback left for it by copy_thread(); if it ever
    returns, that's because it has just done successful kernel_execve()
    jump to return from syscall
    IOW, its only difference from ret_from_fork() is that it does call the
    callback.
    * such an architecture should also get rid of ret_from_kernel_execve()
    and __ARCH_WANT_KERNEL_EXECVE

    This is the last part of infrastructure patches in that area - from
    that point on work on different architectures can live independently.

    Signed-off-by: Al Viro

    Al Viro
     

12 Oct, 2012

1 commit

  • The only place where kernel_execve() is called without a way to
    return to the caller of kernel_thread() callback is kernel_post().
    Reorganize kernel_init()/kernel_post() - instead of the former
    calling the latter in the end (and getting freed by it), have the
    latter *begin* with calling the former (and turn the latter into
    kernel_thread() callback, of course).

    Signed-off-by: Al Viro

    Al Viro
     

09 Oct, 2012

1 commit

  • After both prio_tree users have been converted to use red-black trees,
    there is no need to keep around the prio tree library anymore.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

30 Sep, 2012

2 commits

  • The ACPI BGRT driver accesses the BIOS logo image when it initializes.
    However, ACPI 5.0 (which introduces the BGRT) recommends putting the
    logo image in EFI boot services memory, so that the OS can reclaim that
    memory. Production systems follow this recommendation, breaking the
    ACPI BGRT driver.

    Move the bulk of the BGRT code to run during a new EFI late
    initialization phase, which occurs after switching EFI to virtual mode,
    and after initializing ACPI, but before freeing boot services memory.
    Copy the BIOS logo image to kernel memory at that point, and make it
    accessible to the BGRT driver. Rework the existing ACPI BGRT driver to
    act as a simple wrapper exposing that image (and the properties from the
    BGRT) via sysfs.

    Signed-off-by: Josh Triplett
    Link: http://lkml.kernel.org/r/93ce9f823f1c1f3bb88bdd662cce08eee7a17f5d.1348876882.git.josh@joshtriplett.org
    Signed-off-by: H. Peter Anvin

    Josh Triplett
     
  • Some new ACPI 5.0 tables reference resources stored in boot services
    memory, so keep that memory around until we have ACPI and can extract
    data from it.

    Signed-off-by: Josh Triplett
    Link: http://lkml.kernel.org/r/baaa6d44bdc4eb0c58e5d1b4ccd2c729f854ac55.1348876882.git.josh@joshtriplett.org
    Signed-off-by: H. Peter Anvin

    Josh Triplett
     

15 Aug, 2012

1 commit

  • This reverts commit bacef661acdb634170a8faddbc1cf28e8f8b9eee.

    This commit has been found to cause serious regressions on a number of
    ASUS machines at the least. We probably need to provide a 1:1 map in
    addition to the EFI virtual memory map in order for this to work.

    Signed-off-by: H. Peter Anvin
    Reported-and-bisected-by: Jérôme Carretero
    Cc: Jan Beulich
    Cc: Matt Fleming
    Cc: Matthew Garrett
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu

    H. Peter Anvin
     

01 Aug, 2012

1 commit

  • When hotadd_new_pgdat() is called to create new pgdat for a new node, a
    fallback zonelist should be created for the new node. There's code to try
    to achieve that in hotadd_new_pgdat() as below:

    /*
    * The node we allocated has no zone fallback lists. For avoiding
    * to access not-initialized zonelist, build here.
    */
    mutex_lock(&zonelists_mutex);
    build_all_zonelists(pgdat, NULL);
    mutex_unlock(&zonelists_mutex);

    But it doesn't work as expected. When hotadd_new_pgdat() is called, the
    new node is still in offline state because node_set_online(nid) hasn't
    been called yet. And build_all_zonelists() only builds zonelists for
    online nodes as:

    for_each_online_node(nid) {
    pg_data_t *pgdat = NODE_DATA(nid);

    build_zonelists(pgdat);
    build_zonelist_cache(pgdat);
    }

    Though we hope to create zonelist for the new pgdat, but it doesn't. So
    add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
    the new pgdat too.

    Signed-off-by: Jiang Liu
    Signed-off-by: Xishi Qiu
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rusty Russell
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

27 Jul, 2012

2 commits

  • …usty/linux-2.6-for-linus

    Pull cpumask changes from Rusty Russell:
    "Trivial comment changes to cpumask code. I guess it's getting boring."

    Boring is good.

    * tag 'cpumask-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    cpumask: cpulist_parse() comments correction
    init: add comments to keep initcall-names in sync with initcall levels
    cpumask: add a few comments of cpumask functions

    Linus Torvalds
     
  • main.c has initcall_level_names[] for parse_args to print in debug messages,
    add comments to keep them in sync with initcalls defined in init.h.

    Also add "loadable" into comment re not using *_initcall macros in
    modules, to disambiguate from kernel/params.c and other builtins.

    Signed-off-by: Jim Cromie
    Acked-by: Borislav Petkov
    Signed-off-by: Rusty Russell

    Jim Cromie