26 Feb, 2009

1 commit

  • This patch fixes a bug located by Vegard Nossum with the aid of
    kmemcheck, updated based on review comments from Nick Piggin,
    Ingo Molnar, and Andrew Morton. And cleans up the variable-name
    and function-name language. ;-)

    The boot CPU runs in the context of its idle thread during boot-up.
    During this time, idle_cpu(0) will always return nonzero, which will
    fool Classic and Hierarchical RCU into deciding that a large chunk of
    the boot-up sequence is a big long quiescent state. This in turn causes
    RCU to prematurely end grace periods during this time.

    This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
    function to ignore the idle task as a quiescent state until the
    system has started up the scheduler in rest_init(), introducing a
    new non-API function rcu_idle_now_means_idle() to inform RCU of this
    transition. RCU maintains an internal rcu_idle_cpu_truthful variable
    to track this state, which is then used by rcu_check_callback() to
    determine if it should believe idle_cpu().

    Because this patch has the effect of disallowing RCU grace periods
    during long stretches of the boot-up sequence, this patch also introduces
    Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
    no-op if num_online_cpus() returns 1. This allows boot-time code that
    calls synchronize_rcu() to proceed normally. Note, however, that RCU
    callbacks registered by call_rcu() will likely queue up until later in
    the boot sequence. Although rcuclassic and rcutree can also use this
    same optimization after boot completes, rcupreempt must restrict its
    use of this optimization to the portion of the boot sequence before the
    scheduler starts up, given that an rcupreempt RCU read-side critical
    section may be preeempted.

    In addition, this patch takes Nick Piggin's suggestion to make the
    system_state global variable be __read_mostly.

    Changes since v4:

    o Changes the name of the introduced function and variable to
    be less emotional. ;-)

    Changes since v3:

    o WARN_ON(nr_context_switches() > 0) to verify that RCU
    switches out of boot-time mode before the first context
    switch, as suggested by Nick Piggin.

    Changes since v2:

    o Created rcu_blocking_is_gp() internal-to-RCU API that
    determines whether a call to synchronize_rcu() is itself
    a grace period.

    o The definition of rcu_blocking_is_gp() for rcuclassic and
    rcutree checks to see if but a single CPU is online.

    o The definition of rcu_blocking_is_gp() for rcupreempt
    checks to see both if but a single CPU is online and if
    the system is still in early boot.

    This allows rcupreempt to again work correctly if running
    on a single CPU after booting is complete.

    o Added check to rcupreempt's synchronize_sched() for there
    being but one online CPU.

    Tested all three variants both SMP and !SMP, booted fine, passed a short
    rcutorture test on both x86 and Power.

    Located-by: Vegard Nossum
    Tested-by: Vegard Nossum
    Tested-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

08 Jan, 2009

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async:
    async: don't do the initcall stuff post boot
    bootchart: improve output based on Dave Jones' feedback
    async: make the final inode deletion an asynchronous event
    fastboot: Make libata initialization even more async
    fastboot: make the libata port scan asynchronous
    fastboot: make scsi probes asynchronous
    async: Asynchronous function calls to speed up kernel boot

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
    trivial: chack -> check typo fix in main Makefile
    trivial: Add a space (and a comma) to a printk in 8250 driver
    trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
    trivial: Fix misspelling of "firmware" in powerpc Makefile
    trivial: Fix misspelling of "firmware" in usb.c
    trivial: Fix misspelling of "firmware" in qla1280.c
    trivial: Fix misspelling of "firmware" in a100u2w.c
    trivial: Fix misspelling of "firmware" in megaraid.c
    trivial: Fix misspelling of "firmware" in ql4_mbx.c
    trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
    trivial: Fix misspelling of "firmware" in ipw2100.c
    trivial: Fix misspelling of "firmware" in atmel.c
    trivial: Fix misspelled firmware in Kconfig
    trivial: fix an -> a typos in documentation and comments
    trivial: fix then -> than typos in comments and documentation
    trivial: update Jesper Juhl CREDITS entry with new email
    trivial: fix singal -> signal typo
    trivial: Fix incorrect use of "loose" in event.c
    trivial: printk: fix indentation of new_text_line declaration
    trivial: rtc-stk17ta8: fix sparse warning
    ...

    Linus Torvalds
     
  • Right now, most of the kernel boot is strictly synchronous, such that
    various hardware delays are done sequentially.

    In order to make the kernel boot faster, this patch introduces
    infrastructure to allow doing some of the initialization steps
    asynchronously, which will hide significant portions of the hardware delays
    in practice.

    In order to not change device order and other similar observables, this
    patch does NOT do full parallel initialization.

    Rather, it operates more in the way an out of order CPU does; the work may
    be done out of order and asynchronous, but the observable effects
    (instruction retiring for the CPU) are still done in the original sequence.

    Signed-off-by: Arjan van de Ven

    Arjan van de Ven
     

07 Jan, 2009

3 commits


06 Jan, 2009

1 commit


04 Jan, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
    x86: setup_per_cpu_areas() cleanup
    cpumask: fix compile error when CONFIG_NR_CPUS is not defined
    cpumask: use alloc_cpumask_var_node where appropriate
    cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
    x86: use cpumask_var_t in acpi/boot.c
    x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
    sched: put back some stack hog changes that were undone in kernel/sched.c
    x86: enable cpus display of kernel_max and offlined cpus
    ia64: cpumask fix for is_affinity_mask_valid()
    cpumask: convert RCU implementations, fix
    xtensa: define __fls
    mn10300: define __fls
    m32r: define __fls
    h8300: define __fls
    frv: define __fls
    cris: define __fls
    cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
    cpumask: zero extra bits in alloc_cpumask_var_node
    cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
    cpumask: convert mm/
    ...

    Linus Torvalds
     

03 Jan, 2009

1 commit

  • Impact: cleanup

    We now have a cleaner check for gcc 4.1.0/4.1.1 trouble in
    include/linux/compiler-gcc4.h, so remove the 4.1.0 quirk from
    init/main.c.

    Reported-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Ingo Molnar
    Acked-by: Sam Ravnborg
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jan, 2009

3 commits

  • Impact: cleanup

    There's one obvious place to use it: to find the highest possible cpu.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Impact: use new API

    cpu_*_map are going away in favour of cpu_*_mask, but const pointers.
    So we have accessors where we really do want to frob them. Archs
    will also need the (trivial) conversion before we can finally remove
    cpu_*_map.

    Signed-off-by: Rusty Russell
    Signed-off-by: Mike Travis

    Rusty Russell
     
  • …l/git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sparseirq: move __weak symbols into separate compilation unit
    sparseirq: work around __weak alias bug
    sparseirq: fix hang with !SPARSE_IRQ
    sparseirq: set lock_class for legacy irq when sparse_irq is selected
    sparseirq: work around compiler optimizing away __weak functions
    sparseirq: fix desc->lock init
    sparseirq: do not printk when migrating IRQ descriptors
    sparseirq: remove duplicated arch_early_irq_init()
    irq: simplify for_each_irq_desc() usage
    proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
    irq: for_each_irq_desc() move to irqnr.h
    hrtimer: remove #include <linux/irq.h>

    Linus Torvalds
     

31 Dec, 2008

1 commit

  • * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, sparseirq: clean up Kconfig entry
    x86: turn CONFIG_SPARSE_IRQ off by default
    sparseirq: fix numa_migrate_irq_desc dependency and comments
    sparseirq: add kernel-doc notation for new member in irq_desc, -v2
    locking, irq: enclose irq_desc_lock_class in CONFIG_LOCKDEP
    sparseirq, xen: make sure irq_desc is allocated for interrupts
    sparseirq: fix !SMP building, #2
    x86, sparseirq: move irq_desc according to smp_affinity, v7
    proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
    sparse irqs: add irqnr.h to the user headers list
    sparse irqs: handle !GENIRQ platforms
    sparseirq: fix !SMP && !PCI_MSI && !HT_IRQ build
    sparseirq: fix Alpha build failure
    sparseirq: fix typo in !CONFIG_IO_APIC case
    x86, MSI: pass irq_cfg and irq_desc
    x86: MSI start irq numbering from nr_irqs_gsi
    x86: use NR_IRQS_LEGACY
    sparse irq_desc[] array: core kernel and x86 changes
    genirq: record IRQ_LEVEL in irq_desc[]
    irq.h: remove padding from irq_desc on 64bits

    Linus Torvalds
     

29 Dec, 2008

2 commits

  • GCC has a bug with __weak alias functions: if the functions are in
    the same compilation unit as their call site, GCC can decide to
    inline them - and thus rob the linker of the opportunity to override
    the weak alias with the real thing.

    So move all the IRQ handling related __weak symbols to kernel/irq/chip.c.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     
  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
    sched, trace: update trace_sched_wakeup()
    tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
    Revert "x86: disable X86_PTRACE_BTS"
    ring-buffer: prevent false positive warning
    ring-buffer: fix dangling commit race
    ftrace: enable format arguments checking
    x86, bts: memory accounting
    x86, bts: add fork and exit handling
    ftrace: introduce tracing_reset_online_cpus() helper
    tracing: fix warnings in kernel/trace/trace_sched_switch.c
    tracing: fix warning in kernel/trace/trace.c
    tracing/ring-buffer: remove unused ring_buffer size
    trace: fix task state printout
    ftrace: add not to regex on filtering functions
    trace: better use of stack_trace_enabled for boot up code
    trace: add a way to enable or disable the stack tracer
    x86: entry_64 - introduce FTRACE_ frame macro v2
    tracing/ftrace: add the printk-msg-only option
    tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
    x86, bts: correctly report invalid bts records
    ...

    Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
    being already partly merged by the SH merge.

    Linus Torvalds
     

27 Dec, 2008

1 commit


08 Dec, 2008

1 commit

  • Impact: new feature

    Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
    NR_CPUS set to large values. The goal is to be able to scale up to much
    larger NR_IRQS value without impacting the (important) common case.

    To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
    irq_desc pointers.

    When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
    this also makes the IRQ descriptors NUMA-local (to the site that calls
    request_irq()).

    This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
    uses desc->chip_data for x86 to store irq_cfg.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

23 Nov, 2008

1 commit

  • Impact: fix initcall debug output on non-scalar ktime platforms (32-bit embedded)

    The initcall_debug code access the tv64 member of ktime. This won't work
    correctly for large deltas on platforms that don't use the scalar ktime
    implementation.

    Signed-off-by: Will Newton
    Acked-by: Tim Bird
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Will Newton
     

14 Nov, 2008

1 commit

  • Inaugurate copy-on-write credentials management. This uses RCU to manage the
    credentials pointer in the task_struct with respect to accesses by other tasks.
    A process may only modify its own credentials, and so does not need locking to
    access or modify its own credentials.

    A mutex (cred_replace_mutex) is added to the task_struct to control the effect
    of PTRACE_ATTACHED on credential calculations, particularly with respect to
    execve().

    With this patch, the contents of an active credentials struct may not be
    changed directly; rather a new set of credentials must be prepared, modified
    and committed using something like the following sequence of events:

    struct cred *new = prepare_creds();
    int ret = blah(new);
    if (ret < 0) {
    abort_creds(new);
    return ret;
    }
    return commit_creds(new);

    There are some exceptions to this rule: the keyrings pointed to by the active
    credentials may be instantiated - keyrings violate the COW rule as managing
    COW keyrings is tricky, given that it is possible for a task to directly alter
    the keys in a keyring in use by another task.

    To help enforce this, various pointers to sets of credentials, such as those in
    the task_struct, are declared const. The purpose of this is compile-time
    discouragement of altering credentials through those pointers. Once a set of
    credentials has been made public through one of these pointers, it may not be
    modified, except under special circumstances:

    (1) Its reference count may incremented and decremented.

    (2) The keyrings to which it points may be modified, but not replaced.

    The only safe way to modify anything else is to create a replacement and commit
    using the functions described in Documentation/credentials.txt (which will be
    added by a later patch).

    This patch and the preceding patches have been tested with the LTP SELinux
    testsuite.

    This patch makes several logical sets of alteration:

    (1) execve().

    This now prepares and commits credentials in various places in the
    security code rather than altering the current creds directly.

    (2) Temporary credential overrides.

    do_coredump() and sys_faccessat() now prepare their own credentials and
    temporarily override the ones currently on the acting thread, whilst
    preventing interference from other threads by holding cred_replace_mutex
    on the thread being dumped.

    This will be replaced in a future patch by something that hands down the
    credentials directly to the functions being called, rather than altering
    the task's objective credentials.

    (3) LSM interface.

    A number of functions have been changed, added or removed:

    (*) security_capset_check(), ->capset_check()
    (*) security_capset_set(), ->capset_set()

    Removed in favour of security_capset().

    (*) security_capset(), ->capset()

    New. This is passed a pointer to the new creds, a pointer to the old
    creds and the proposed capability sets. It should fill in the new
    creds or return an error. All pointers, barring the pointer to the
    new creds, are now const.

    (*) security_bprm_apply_creds(), ->bprm_apply_creds()

    Changed; now returns a value, which will cause the process to be
    killed if it's an error.

    (*) security_task_alloc(), ->task_alloc_security()

    Removed in favour of security_prepare_creds().

    (*) security_cred_free(), ->cred_free()

    New. Free security data attached to cred->security.

    (*) security_prepare_creds(), ->cred_prepare()

    New. Duplicate any security data attached to cred->security.

    (*) security_commit_creds(), ->cred_commit()

    New. Apply any security effects for the upcoming installation of new
    security by commit_creds().

    (*) security_task_post_setuid(), ->task_post_setuid()

    Removed in favour of security_task_fix_setuid().

    (*) security_task_fix_setuid(), ->task_fix_setuid()

    Fix up the proposed new credentials for setuid(). This is used by
    cap_set_fix_setuid() to implicitly adjust capabilities in line with
    setuid() changes. Changes are made to the new credentials, rather
    than the task itself as in security_task_post_setuid().

    (*) security_task_reparent_to_init(), ->task_reparent_to_init()

    Removed. Instead the task being reparented to init is referred
    directly to init's credentials.

    NOTE! This results in the loss of some state: SELinux's osid no
    longer records the sid of the thread that forked it.

    (*) security_key_alloc(), ->key_alloc()
    (*) security_key_permission(), ->key_permission()

    Changed. These now take cred pointers rather than task pointers to
    refer to the security context.

    (4) sys_capset().

    This has been simplified and uses less locking. The LSM functions it
    calls have been merged.

    (5) reparent_to_kthreadd().

    This gives the current thread the same credentials as init by simply using
    commit_thread() to point that way.

    (6) __sigqueue_alloc() and switch_uid()

    __sigqueue_alloc() can't stop the target task from changing its creds
    beneath it, so this function gets a reference to the currently applicable
    user_struct which it then passes into the sigqueue struct it returns if
    successful.

    switch_uid() is now called from commit_creds(), and possibly should be
    folded into that. commit_creds() should take care of protecting
    __sigqueue_alloc().

    (7) [sg]et[ug]id() and co and [sg]et_current_groups.

    The set functions now all use prepare_creds(), commit_creds() and
    abort_creds() to build and check a new set of credentials before applying
    it.

    security_task_set[ug]id() is called inside the prepared section. This
    guarantees that nothing else will affect the creds until we've finished.

    The calling of set_dumpable() has been moved into commit_creds().

    Much of the functionality of set_user() has been moved into
    commit_creds().

    The get functions all simply access the data directly.

    (8) security_task_prctl() and cap_task_prctl().

    security_task_prctl() has been modified to return -ENOSYS if it doesn't
    want to handle a function, or otherwise return the return value directly
    rather than through an argument.

    Additionally, cap_task_prctl() now prepares a new set of credentials, even
    if it doesn't end up using it.

    (9) Keyrings.

    A number of changes have been made to the keyrings code:

    (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
    all been dropped and built in to the credentials functions directly.
    They may want separating out again later.

    (b) key_alloc() and search_process_keyrings() now take a cred pointer
    rather than a task pointer to specify the security context.

    (c) copy_creds() gives a new thread within the same thread group a new
    thread keyring if its parent had one, otherwise it discards the thread
    keyring.

    (d) The authorisation key now points directly to the credentials to extend
    the search into rather pointing to the task that carries them.

    (e) Installing thread, process or session keyrings causes a new set of
    credentials to be created, even though it's not strictly necessary for
    process or session keyrings (they're shared).

    (10) Usermode helper.

    The usermode helper code now carries a cred struct pointer in its
    subprocess_info struct instead of a new session keyring pointer. This set
    of credentials is derived from init_cred and installed on the new process
    after it has been cloned.

    call_usermodehelper_setup() allocates the new credentials and
    call_usermodehelper_freeinfo() discards them if they haven't been used. A
    special cred function (prepare_usermodeinfo_creds()) is provided
    specifically for call_usermodehelper_setup() to call.

    call_usermodehelper_setkeys() adjusts the credentials to sport the
    supplied keyring as the new session keyring.

    (11) SELinux.

    SELinux has a number of changes, in addition to those to support the LSM
    interface changes mentioned above:

    (a) selinux_setprocattr() no longer does its check for whether the
    current ptracer can access processes with the new SID inside the lock
    that covers getting the ptracer's SID. Whilst this lock ensures that
    the check is done with the ptracer pinned, the result is only valid
    until the lock is released, so there's no point doing it inside the
    lock.

    (12) is_single_threaded().

    This function has been extracted from selinux_setprocattr() and put into
    a file of its own in the lib/ directory as join_session_keyring() now
    wants to use it too.

    The code in SELinux just checked to see whether a task shared mm_structs
    with other tasks (CLONE_VM), but that isn't good enough. We really want
    to know if they're part of the same thread group (CLONE_THREAD).

    (13) nfsd.

    The NFS server daemon now has to use the COW credentials to set the
    credentials it is going to use. It really needs to pass the credentials
    down to the functions it calls, but it can't do that until other patches
    in this series have been applied.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Signed-off-by: James Morris

    David Howells
     

12 Nov, 2008

2 commits

  • Impact: Split the boot tracer entries in two parts: call and return

    Now that we are using the sched tracer from the boot tracer, we want
    to use the same timestamp than the ring-buffer to have consistent time
    captures between sched events and initcall events.

    So we get rid of the old time capture by the boot tracer and split the
    initcall events in two parts: call and return. This way we have the
    ring buffer timestamp of both.

    An example trace:

    [ 27.904149584] calling net_ns_init+0x0/0x1c0 @ 1
    [ 27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs
    [ 27.904575926] calling reboot_init+0x0/0x20 @ 1
    [ 27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs
    [ 27.904800228] calling sysctl_init+0x0/0x30 @ 1
    [ 27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.905287211] calling ksysfs_init+0x0/0xb0 @ 1
    ##### CPU 0 buffer started ####
    init-1 [000] 27.905395: 1:120:R + [001] 11:115:S
    ##### CPU 1 buffer started ####
    -0 [001] 27.905425: 0:140:R ==> [001] 11:115:R
    init-1 [000] 27.905426: 1:120:D ==> [000] 0:140:R
    -0 [000] 27.905431: 0:140:R + [000] 4:115:S
    -0 [000] 27.905451: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905456: 4:115:S ==> [000] 0:140:R
    udevd-11 [001] 27.905458: 11:115:R + [001] 14:115:R
    -0 [000] 27.905459: 0:140:R + [000] 4:115:S
    -0 [000] 27.905462: 0:140:R ==> [000] 4:115:R
    udevd-11 [001] 27.905462: 11:115:R ==> [001] 14:115:R
    ksoftirqd/0-4 [000] 27.905467: 4:115:S ==> [000] 0:140:R
    -0 [000] 27.905470: 0:140:R + [000] 4:115:S
    -0 [000] 27.905473: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905476: 4:115:S ==> [000] 0:140:R
    -0 [000] 27.905479: 0:140:R + [000] 4:115:S
    -0 [000] 27.905482: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905486: 4:115:S ==> [000] 0:140:R
    udevd-14 [001] 27.905499: 14:120:X ==> [001] 11:115:R
    udevd-11 [001] 27.905506: 11:115:R + [000] 1:120:D
    -0 [000] 27.905515: 0:140:R ==> [000] 1:120:R
    udevd-11 [001] 27.905517: 11:115:S ==> [001] 0:140:R
    [ 27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs
    [ 27.905705736] calling init_jiffies_clocksource+0x0/0x10 @ 1
    [ 27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs
    [ 27.906769814] calling pm_init+0x0/0x30 @ 1
    [ 27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.906997803] calling pm_disk_init+0x0/0x20 @ 1
    [ 27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs
    [ 27.907222556] calling swsusp_header_init+0x0/0x30 @ 1
    [ 27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.907439620] calling stop_machine_init+0x0/0x50 @ 1
    init-1 [000] 27.907485: 1:120:R + [000] 2:115:S
    init-1 [000] 27.907490: 1:120:D ==> [000] 2:115:R
    kthreadd-2 [000] 27.907507: 2:115:R + [001] 15:115:R
    -0 [001] 27.907517: 0:140:R ==> [001] 15:115:R
    kthreadd-2 [000] 27.907517: 2:115:D ==> [000] 0:140:R
    -0 [000] 27.907521: 0:140:R + [000] 4:115:S
    -0 [000] 27.907524: 0:140:R ==> [000] 4:115:R
    udevd-15 [001] 27.907527: 15:115:D + [000] 2:115:D
    ksoftirqd/0-4 [000] 27.907537: 4:115:S ==> [000] 2:115:R
    udevd-15 [001] 27.907537: 15:115:D ==> [001] 0:140:R
    kthreadd-2 [000] 27.907546: 2:115:R + [000] 1:120:D
    kthreadd-2 [000] 27.907550: 2:115:S ==> [000] 1:120:R
    init-1 [000] 27.907584: 1:120:R + [000] 15: 0:D
    init-1 [000] 27.907589: 1:120:R + [000] 2:115:S
    init-1 [000] 27.907593: 1:120:D ==> [000] 15: 0:R
    udevd-15 [000] 27.907601: 15: 0:S ==> [000] 2:115:R
    ##### CPU 0 buffer started ####
    kthreadd-2 [000] 27.907616: 2:115:R + [001] 16:115:R
    ##### CPU 1 buffer started ####
    -0 [001] 27.907620: 0:140:R ==> [001] 16:115:R
    kthreadd-2 [000] 27.907621: 2:115:D ==> [000] 0:140:R
    udevd-16 [001] 27.907625: 16:115:D + [000] 2:115:D
    -0 [000] 27.907628: 0:140:R + [000] 4:115:S
    udevd-16 [001] 27.907629: 16:115:D ==> [001] 0:140:R
    -0 [000] 27.907631: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.907636: 4:115:S ==> [000] 2:115:R
    kthreadd-2 [000] 27.907644: 2:115:R + [000] 1:120:D
    kthreadd-2 [000] 27.907647: 2:115:S ==> [000] 1:120:R
    init-1 [000] 27.907657: 1:120:R + [001] 16: 0:D
    -0 [001] 27.907666: 0:140:R ==> [001] 16: 0:R
    [ 27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs
    [ 27.907850704] calling filelock_init+0x0/0x30 @ 1
    [ 27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.908071327] calling init_script_binfmt+0x0/0x10 @ 1
    [ 27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs
    [ 27.908309461] calling init_elf_binfmt+0x0/0x10 @ 1

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Impact: Cleanups on the boot tracer and ftrace

    This patch bring some cleanups about the boot tracer headers. The
    functions and structures of this tracer have nothing related to ftrace
    and should have so their own header file.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

05 Nov, 2008

1 commit

  • Impact: modify boot tracer

    We used to disable the initcall tracing at a specified time (IE: end
    of builtin initcalls). But we don't need it anymore. It will be
    stopped when initcalls are finished.

    However we want two things:

    _Start this tracing only after pre-smp initcalls are finished.

    _Since we are planning to trace sched_switches at the same time, we
    want to enable them only during the initcall execution.

    For this purpose, this patch introduce two functions to enable/disable
    the sched_switch tracing during boot.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

26 Oct, 2008

1 commit

  • This reverts commit a802dd0eb5fc97a50cf1abb1f788a8f6cc5db635 by moving
    the call to init_workqueues() back where it belongs - after SMP has been
    initialized.

    It also moves stop_machine_init() - which needs workqueues - to a later
    phase using a core_initcall() instead of early_initcall(). That should
    satisfy all ordering requirements, and was apparently the reason why
    init_workqueues() was moved to be too early.

    Cc: Heiko Carstens
    Cc: Rusty Russell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Oct, 2008

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
    [PATCH] fs: add a sanity check in d_free
    [PATCH] i_version: remount support
    [patch] vfs: make security_inode_setattr() calling consistent
    [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
    [PATCH] move executable checking into ->permission()
    [PATCH] fs/dcache.c: update comment of d_validate()
    [RFC PATCH] touch_mnt_namespace when the mount flags change
    [PATCH] reiserfs: add missing llseek method
    [PATCH] fix ->llseek for more directories
    [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
    [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
    [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
    [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
    [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
    [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
    [PATCH] get rid of on-stack dentry in udf
    [PATCH 2/2] anondev: switch to IDA
    [PATCH 1/2] anondev: init IDR statically
    [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
    [PATCH] Optimise NFS readdir hack slightly.
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    stop_machine: fix error code handling on multiple cpus
    stop_machine: use workqueues instead of kernel threads
    workqueue: introduce create_rt_workqueue
    Call init_workqueues before pre smp initcalls.
    Make panic= and panic_on_oops into core_params
    Make initcall_debug a core_param
    core_param() for genuinely core kernel parameters
    param: Fix duplicate module prefixes
    module: check kernel param length at compile time, not runtime
    Remove stop_machine during module load v2
    module: simplify load_module.

    Linus Torvalds
     

23 Oct, 2008

2 commits

  • page_cgroup_init() is called from mem_cgroup_init(). But at this
    point, we cannot call alloc_bootmem().
    (and this caused panic at boot.)

    This patch moves page_cgroup_init() to init/main.c.

    Time table is following:
    ==
    parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
    ....
    cgroup_init_early() # "early" init of cgroup.
    ....
    setup_arch() # memmap is allocated.
    ...
    page_cgroup_init();
    mem_init(); # we cannot call alloc_bootmem after this.
    ....
    cgroup_init() # mem_cgroup is initialized.
    ==

    Before page_cgroup_init(), mem_map must be initialized. So,
    I added page_cgroup_init() to init/main.c directly.

    (*) maybe this is not very clean but
    - cgroup_init_early() is too early
    - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
    use of vmalloc area in x86-32 is important and we should avoid very large
    vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
    directly to init/main.c

    [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
    [akpm@linux-foundation.org: fix build]
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Signed-off-by: Alexey Dobriyan

    Alexey Dobriyan
     

22 Oct, 2008

2 commits


21 Oct, 2008

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits)
    tracing/fastboot: improve help text
    tracing/stacktrace: improve help text
    tracing/fastboot: fix initcalls disposition in bootgraph.pl
    tracing/fastboot: fix bootgraph.pl initcall name regexp
    tracing/fastboot: fix issues and improve output of bootgraph.pl
    tracepoints: synchronize unregister static inline
    tracepoints: tracepoint_synchronize_unregister()
    ftrace: make ftrace_test_p6nop disassembler-friendly
    markers: fix synchronize marker unregister static inline
    tracing/fastboot: add better resolution to initcall debug/tracing
    trace: add build-time check to avoid overrunning hex buffer
    ftrace: fix hex output mode of ftrace
    tracing/fastboot: fix initcalls disposition in bootgraph.pl
    tracing/fastboot: fix printk format typo in boot tracer
    ftrace: return an error when setting a nonexistent tracer
    ftrace: make some tracers reentrant
    ring-buffer: make reentrant
    ring-buffer: move page indexes into page headers
    tracing/fastboot: only trace non-module initcalls
    ftrace: move pc counter in irqtrace
    ...

    Manually fix conflicts:
    - init/main.c: initcall tracing
    - kernel/module.c: verbose level vs tracepoints
    - scripts/bootgraph.pl: fallout from cherry-picking commits.

    Linus Torvalds
     

20 Oct, 2008

1 commit

  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

14 Oct, 2008

7 commits

  • Change the time resolution for initcall_debug to microseconds, from
    milliseconds. This is handy to determine which initcalls you want to work
    on for faster booting.

    One one of my test machines, over 90% of the initcalls are less than a
    millisecond and (without this patch) these are all reported as 0 msecs.
    Working on the 900 us ones is more important than the 4 us ones.

    With 'quiet' on the kernel command line, this adds no significant overhead
    to kernel boot time.

    Signed-off-by: Tim Bird
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Tim Bird
     
  • At this time, only built-in initcalls interest us.
    We can't really produce a relevant graph if we include
    the modules initcall too.

    I had good results after this patch (see svg in attachment).

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • After some initcall traces, some initcall names may be inconsistent.
    That's because these functions will disappear from the .init section
    and also their name from the symbols table.

    So we have to copy the name of the function in a buffer large enough
    during the trace appending. It is not costly for the ring_buffer because
    the number of initcall entries is commonly not really large.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Change the boot tracer printing to make it parsable for
    the scripts/bootgraph.pl script.

    We have now to output two lines for each initcall, according to the
    printk in do_one_initcall() in init/main.c
    We need now the call's time and the return's time.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Launch the boot tracing inside the initcall_debug area. Old printk
    have not been removed to keep the old way of initcall tracing for
    backward compatibility.

    [ mingo@elte.hu: resolved conflicts ]
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frédéric Weisbecker
     
  • When optimizing the kernel boot time, it's very valuable to visualize
    what is going on at which time. In addition, with the fastboot asynchronous
    initcall level, it's very valuable to see which initcall gets run where
    and when.

    This patch adds a script to turn a dmesg into a SVG graph (that can be
    shown with tools such as InkScape, Gimp or Firefox) and a small change
    to the initcall code to print the PID of the thread calling the initcall
    (so that the script can work out the parallelism).

    Signed-off-by: Arjan van de Ven

    Arjan van de Ven
     
  • This is the infrastructure to the converting the mcount call sites
    recorded by the __mcount_loc section into nops on boot. It also allows
    for using these sites to enable tracing as normal. When the __mcount_loc
    section is used, the "ftraced" kernel thread is disabled.

    This uses the current infrastructure to record the mcount call sites
    as well as convert them to nops. The mcount function is kept as a stub
    on boot up and not converted to the ftrace_record_ip function. We use the
    ftrace_record_ip to only record from the table.

    This patch does not handle modules. That comes with a later patch.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

12 Oct, 2008

1 commit

  • When optimizing the kernel boot time, it's very valuable to visualize
    what is going on at which time. In addition, with some of the initializing
    going asynchronous soon, it's valuable to track/print which worker thread
    is executing the initialization.

    This patch adds a script to turn a dmesg into a SVG graph (that can be
    shown with tools such as InkScape, Gimp or Firefox) and a small change
    to the initcall code to print the PID of the thread calling the initcall
    (so that the script can work out the parallelism).

    Signed-off-by: Arjan van de Ven

    Arjan van de Ven