04 Jan, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
    x86: setup_per_cpu_areas() cleanup
    cpumask: fix compile error when CONFIG_NR_CPUS is not defined
    cpumask: use alloc_cpumask_var_node where appropriate
    cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
    x86: use cpumask_var_t in acpi/boot.c
    x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
    sched: put back some stack hog changes that were undone in kernel/sched.c
    x86: enable cpus display of kernel_max and offlined cpus
    ia64: cpumask fix for is_affinity_mask_valid()
    cpumask: convert RCU implementations, fix
    xtensa: define __fls
    mn10300: define __fls
    m32r: define __fls
    h8300: define __fls
    frv: define __fls
    cris: define __fls
    cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
    cpumask: zero extra bits in alloc_cpumask_var_node
    cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
    cpumask: convert mm/
    ...

    Linus Torvalds
     

03 Jan, 2009

2 commits

  • …/git/tip/linux-2.6-tip

    * 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
    x86: export vector_used_by_percpu_irq
    x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
    sched: nominate preferred wakeup cpu, fix
    x86: fix lguest used_vectors breakage, -v2
    x86: fix warning in arch/x86/kernel/io_apic.c
    sched: fix warning in kernel/sched.c
    sched: move test_sd_parent() to an SMP section of sched.h
    sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
    sched: activate active load balancing in new idle cpus
    sched: bias task wakeups to preferred semi-idle packages
    sched: nominate preferred wakeup cpu
    sched: favour lower logical cpu number for sched_mc balance
    sched: framework for sched_mc/smt_power_savings=N
    sched: convert BALANCE_FOR_xx_POWER to inline functions
    x86: use possible_cpus=NUM to extend the possible cpus allowed
    x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
    x86: update io_apic.c to the new cpumask code
    x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
    x86: xen: use smp_call_function_many()
    x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
    ...

    Fixed up trivial conflict in kernel/time/tick-sched.c manually

    Linus Torvalds
     
  • Impact: cleanup

    We now have a cleaner check for gcc 4.1.0/4.1.1 trouble in
    include/linux/compiler-gcc4.h, so remove the 4.1.0 quirk from
    init/main.c.

    Reported-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Ingo Molnar
    Acked-by: Sam Ravnborg
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jan, 2009

3 commits

  • Impact: cleanup

    There's one obvious place to use it: to find the highest possible cpu.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Impact: use new API

    cpu_*_map are going away in favour of cpu_*_mask, but const pointers.
    So we have accessors where we really do want to frob them. Archs
    will also need the (trivial) conversion before we can finally remove
    cpu_*_map.

    Signed-off-by: Rusty Russell
    Signed-off-by: Mike Travis

    Rusty Russell
     
  • …l/git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sparseirq: move __weak symbols into separate compilation unit
    sparseirq: work around __weak alias bug
    sparseirq: fix hang with !SPARSE_IRQ
    sparseirq: set lock_class for legacy irq when sparse_irq is selected
    sparseirq: work around compiler optimizing away __weak functions
    sparseirq: fix desc->lock init
    sparseirq: do not printk when migrating IRQ descriptors
    sparseirq: remove duplicated arch_early_irq_init()
    irq: simplify for_each_irq_desc() usage
    proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
    irq: for_each_irq_desc() move to irqnr.h
    hrtimer: remove #include <linux/irq.h>

    Linus Torvalds
     

31 Dec, 2008

3 commits

  • Conflicts:

    arch/x86/kernel/io_apic.c

    Rusty Russell
     
  • * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, sparseirq: clean up Kconfig entry
    x86: turn CONFIG_SPARSE_IRQ off by default
    sparseirq: fix numa_migrate_irq_desc dependency and comments
    sparseirq: add kernel-doc notation for new member in irq_desc, -v2
    locking, irq: enclose irq_desc_lock_class in CONFIG_LOCKDEP
    sparseirq, xen: make sure irq_desc is allocated for interrupts
    sparseirq: fix !SMP building, #2
    x86, sparseirq: move irq_desc according to smp_affinity, v7
    proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
    sparse irqs: add irqnr.h to the user headers list
    sparse irqs: handle !GENIRQ platforms
    sparseirq: fix !SMP && !PCI_MSI && !HT_IRQ build
    sparseirq: fix Alpha build failure
    sparseirq: fix typo in !CONFIG_IO_APIC case
    x86, MSI: pass irq_cfg and irq_desc
    x86: MSI start irq numbering from nr_irqs_gsi
    x86: use NR_IRQS_LEGACY
    sparse irq_desc[] array: core kernel and x86 changes
    genirq: record IRQ_LEVEL in irq_desc[]
    irq.h: remove padding from irq_desc on 64bits

    Linus Torvalds
     
  • * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
    stacktrace: provide save_stack_trace_tsk() weak alias
    rcu: provide RCU options on non-preempt architectures too
    printk: fix discarding message when recursion_bug
    futex: clean up futex_(un)lock_pi fault handling
    "Tree RCU": scalable classic RCU implementation
    futex: rename field in futex_q to clarify single waiter semantics
    x86/swiotlb: add default swiotlb_arch_range_needs_mapping
    x86/swiotlb: add default physbus conversion
    x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
    x86: add swiotlb allocation functions
    swiotlb: consolidate swiotlb info message printing
    swiotlb: support bouncing of HighMem pages
    swiotlb: factor out copy to/from device
    swiotlb: add arch hook to force mapping
    swiotlb: allow architectures to override physbusphys conversions
    swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
    rcu: fix rcutorture behavior during reboot
    resources: skip sanity check of busy resources
    swiotlb: move some definitions to header
    swiotlb: allow architectures to override swiotlb pool allocation
    ...

    Fix up trivial conflicts in
    arch/x86/kernel/Makefile
    arch/x86/mm/init_32.c
    include/linux/hardirq.h
    as per Ingo's suggestions.

    Linus Torvalds
     

30 Dec, 2008

1 commit


29 Dec, 2008

3 commits

  • GCC has a bug with __weak alias functions: if the functions are in
    the same compilation unit as their call site, GCC can decide to
    inline them - and thus rob the linker of the opportunity to override
    the weak alias with the real thing.

    So move all the IRQ handling related __weak symbols to kernel/irq/chip.c.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (25 commits)
    allow stripping of generated symbols under CONFIG_KALLSYMS_ALL
    kbuild: strip generated symbols from *.ko
    kbuild: simplify use of genksyms
    kernel-doc: check for extra kernel-doc notations
    kbuild: add headerdep used to detect inclusion cycles in header files
    kbuild: fix string equality testing in tags.sh
    kbuild: fix make tags/cscope
    kbuild: fix make incompatibility
    kbuild: remove TAR_IGNORE
    setlocalversion: add git-svn support
    setlocalversion: print correct subversion revision
    scripts: improve the decodecode script
    scripts/package: allow custom options to rpm
    genksyms: allow to ignore symbol checksum changes
    genksyms: track symbol checksum changes
    tags and cscope support really belongs in a shell script
    kconfig: fix options to check-lxdialog.sh
    kbuild: gen_init_cpio expands shell variables in file names
    remove bashisms from scripts/extract-ikconfig
    kbuild: teach mkmakfile to be silent
    ...

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
    sched, trace: update trace_sched_wakeup()
    tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
    Revert "x86: disable X86_PTRACE_BTS"
    ring-buffer: prevent false positive warning
    ring-buffer: fix dangling commit race
    ftrace: enable format arguments checking
    x86, bts: memory accounting
    x86, bts: add fork and exit handling
    ftrace: introduce tracing_reset_online_cpus() helper
    tracing: fix warnings in kernel/trace/trace_sched_switch.c
    tracing: fix warning in kernel/trace/trace.c
    tracing/ring-buffer: remove unused ring_buffer size
    trace: fix task state printout
    ftrace: add not to regex on filtering functions
    trace: better use of stack_trace_enabled for boot up code
    trace: add a way to enable or disable the stack tracer
    x86: entry_64 - introduce FTRACE_ frame macro v2
    tracing/ftrace: add the printk-msg-only option
    tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
    x86, bts: correctly report invalid bts records
    ...

    Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
    being already partly merged by the SH merge.

    Linus Torvalds
     

27 Dec, 2008

1 commit


25 Dec, 2008

1 commit

  • Impact: build fix

    Some old architectures still do not use kernel/Kconfig.preempt, so the
    moving of the RCU options there broke their build:

    In file included from /home/mingo/tip/include/linux/sem.h:81,
    from /home/mingo/tip/include/linux/sched.h:69,
    from /home/mingo/tip/arch/alpha/kernel/asm-offsets.c:9:
    /home/mingo/tip/include/linux/rcupdate.h:62:2: error: #error "Unknown RCU implementation specified to kernel configuration"

    Move these options back to init/Kconfig, which every architecture
    includes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

20 Dec, 2008

1 commit

  • Building upon parts of the module stripping patch, this patch
    introduces similar stripping for vmlinux when CONFIG_KALLSYMS_ALL=y.
    Using CONFIG_KALLSYMS_STRIP_GENERATED reduces the overhead of
    CONFIG_KALLSYMS_ALL from 245k/310k to 65k/80k for the (i386/x86-64)
    kernels I tested with.

    The patch also does away with the need to special case the kallsyms-
    internal symbols by making them available even in the first linking
    stage.

    While it is a generated file, the patch includes the changes to
    scripts/genksyms/keywords.c_shipped, as I'm unsure what the procedure
    here is.

    Signed-off-by: Jan Beulich
    Signed-off-by: Sam Ravnborg

    Jan Beulich
     

19 Dec, 2008

1 commit

  • This patch fixes a long-standing performance bug in classic RCU that
    results in massive internal-to-RCU lock contention on systems with
    more than a few hundred CPUs. Although this patch creates a separate
    flavor of RCU for ease of review and patch maintenance, it is intended
    to replace classic RCU.

    This patch still handles stress better than does mainline, so I am still
    calling it ready for inclusion. This patch is against the -tip tree.
    Nevertheless, experience on an actual 1000+ CPU machine would still be
    most welcome.

    Most of the changes noted below were found while creating an rcutiny
    (which should permit ejecting the current rcuclassic) and while doing
    detailed line-by-line documentation.

    Updates from v9 (http://lkml.org/lkml/2008/12/2/334):

    o Fixes from remainder of line-by-line code walkthrough,
    including comment spelling, initialization, undesirable
    narrowing due to type conversion, removing redundant memory
    barriers, removing redundant local-variable initialization,
    and removing redundant local variables.

    I do not believe that any of these fixes address the CPU-hotplug
    issues that Andi Kleen was seeing, but please do give it a whirl
    in case the machine is smarter than I am.

    A writeup from the walkthrough may be found at the following
    URL, in case you are suffering from terminal insomnia or
    masochism:

    http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf

    o Made rcutree tracing use seq_file, as suggested some time
    ago by Lai Jiangshan.

    o Added a .csv variant of the rcudata debugfs trace file, to allow
    people having thousands of CPUs to drop the data into
    a spreadsheet. Tested with oocalc and gnumeric. Updated
    documentation to suit.

    Updates from v8 (http://lkml.org/lkml/2008/11/15/139):

    o Fix a theoretical race between grace-period initialization and
    force_quiescent_state() that could occur if more than three
    jiffies were required to carry out the grace-period
    initialization. Which it might, if you had enough CPUs.

    o Apply Ingo's printk-standardization patch.

    o Substitute local variables for repeated accesses to global
    variables.

    o Fix comment misspellings and redundant (but harmless) increments
    of ->n_rcu_pending (this latter after having explicitly added it).

    o Apply checkpatch fixes.

    Updates from v7 (http://lkml.org/lkml/2008/10/10/291):

    o Fixed a number of problems noted by Gautham Shenoy, including
    the cpu-stall-detection bug that he was having difficulty
    convincing me was real. ;-)

    o Changed cpu-stall detection to wait for ten seconds rather than
    three in order to reduce false positive, as suggested by Ingo
    Molnar.

    o Produced a design document (http://lwn.net/Articles/305782/).
    The act of writing this document uncovered a number of both
    theoretical and "here and now" bugs as noted below.

    o Fix dynticks_nesting accounting confusion, simplify WARN_ON()
    condition, fix kerneldoc comments, and add memory barriers
    in dynticks interface functions.

    o Add more data to tracing.

    o Remove unused "rcu_barrier" field from rcu_data structure.

    o Count calls to rcu_pending() from scheduling-clock interrupt
    to use as a surrogate timebase should jiffies stop counting.

    o Fix a theoretical race between force_quiescent_state() and
    grace-period initialization. Yes, initialization does have to
    go on for some jiffies for this race to occur, but given enough
    CPUs...

    Updates from v6 (http://lkml.org/lkml/2008/9/23/448):

    o Fix a number of checkpatch.pl complaints.

    o Apply review comments from Ingo Molnar and Lai Jiangshan
    on the stall-detection code.

    o Fix several bugs in !CONFIG_SMP builds.

    o Fix a misspelled config-parameter name so that RCU now announces
    at boot time if stall detection is configured.

    o Run tests on numerous combinations of configurations parameters,
    which after the fixes above, now build and run correctly.

    Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):

    o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
    changeset some time ago, and finally got around to retesting
    this option).

    o Fix some tracing bugs in rcupreempt that caused incorrect
    totals to be printed.

    o I now test with a more brutal random-selection online/offline
    script (attached). Probably more brutal than it needs to be
    on the people reading it as well, but so it goes.

    o A number of optimizations and usability improvements:

    o Make rcu_pending() ignore the grace-period timeout when
    there is no grace period in progress.

    o Make force_quiescent_state() avoid going for a global
    lock in the case where there is no grace period in
    progress.

    o Rearrange struct fields to improve struct layout.

    o Make call_rcu() initiate a grace period if RCU was
    idle, rather than waiting for the next scheduling
    clock interrupt.

    o Invoke rcu_irq_enter() and rcu_irq_exit() only when
    idle, as suggested by Andi Kleen. I still don't
    completely trust this change, and might back it out.

    o Make CONFIG_RCU_TRACE be the single config variable
    manipulated for all forms of RCU, instead of the prior
    confusion.

    o Document tracing files and formats for both rcupreempt
    and rcutree.

    Updates from v4 for those missing v5 given its bad subject line:

    o Separated dynticks interface so that NMIs and irqs call separate
    functions, greatly simplifying it. In particular, this code
    no longer requires a proof of correctness. ;-)

    o Separated dynticks state out into its own per-CPU structure,
    avoiding the duplicated accounting.

    o The case where a dynticks-idle CPU runs an irq handler that
    invokes call_rcu() is now correctly handled, forcing that CPU
    out of dynticks-idle mode.

    o Review comments have been applied (thank you all!!!).
    For but one example, fixed the dynticks-ordering issue that
    Manfred pointed out, saving me much debugging. ;-)

    o Adjusted rcuclassic and rcupreempt to handle dynticks changes.

    Attached is an updated patch to Classic RCU that applies a hierarchy,
    greatly reducing the contention on the top-level lock for large machines.
    This passes 10-hour concurrent rcutorture and online-offline testing on
    128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
    bugs in presence of dynticks (exciting working on a system where
    "sleep 1" hangs until interrupted...), which were fixed in the
    2.6.27 kernel. It is getting more reliable than mainline by some
    measures, so the next version will be against -tip for inclusion.
    See also Manfred Spraul's recent patches (or his earlier work from
    2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
    We will converge onto a common patch in the fullness of time, but are
    currently exploring different regions of the design space. That said,
    I have already gratefully stolen quite a few of Manfred's ideas.

    This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
    of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on
    64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
    there is no hierarchy. By default, the RCU initialization code will
    adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
    architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
    this balancing, allowing the hierarchy to be exactly aligned to the
    underlying hardware. Up to two levels of hierarchy are permitted
    (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
    systems and up to 262,144 CPUs on 64-bit systems. I just know that I
    am going to regret saying this, but this seems more than sufficient
    for the foreseeable future. (Some architectures might wish to set
    CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
    If this becomes a real problem, additional levels can be added, but I
    doubt that it will make a significant difference on real hardware.)

    In the common case, a given CPU will manipulate its private rcu_data
    structure and the rcu_node structure that it shares with its immediate
    neighbors. This can reduce both lock and memory contention by multiple
    orders of magnitude, which should eliminate the need for the strange
    manipulations that are reported to be required when running Linux on
    very large systems.

    Some shortcomings:

    o More bugs will probably surface as a result of an ongoing
    line-by-line code inspection.

    Patches will be provided as required.

    o There are probably hangs, rcutorture failures, &c. Seems
    quite stable on a 128-CPU machine, but that is kind of small
    compared to 4096 CPUs. However, seems to do better than
    mainline.

    Patches will be provided as required.

    o The memory footprint of this version is several KB larger
    than rcuclassic.

    A separate UP-only rcutiny patch will be provided, which will
    reduce the memory footprint significantly, even compared
    to the old rcuclassic. One such patch passes light testing,
    and has a memory footprint smaller even than rcuclassic.
    Initial reaction from various embedded guys was "it is not
    worth it", so am putting it aside.

    Credits:

    o Manfred Spraul for ideas, review comments, and bugs spotted,
    as well as some good friendly competition. ;-)

    o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
    Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
    for reviews and comments.

    o Thomas Gleixner for much-needed help with some timer issues
    (see patches below).

    o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
    Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
    Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
    alive despite my heavy abuse^Wtesting.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

13 Dec, 2008

2 commits

  • Conflicts:

    arch/x86/kernel/io_apic.c
    kernel/sched.c
    kernel/sched_stats.h

    Rusty Russell
     
  • Impact: cleanup

    Each SMP arch defines these themselves. Move them to a central
    location.

    Twists:
    1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
    CONFIG_INIT_ALL_POSSIBLE for this rather than break them.

    2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
    Those archs simply have phys_cpu_present_map replaced everywhere.

    3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
    so I just manipulate them both in sync.

    4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
    declarations.

    Signed-off-by: Rusty Russell
    Reviewed-by: Grant Grundler
    Tested-by: Tony Luck
    Acked-by: Ingo Molnar
    Cc: Mike Travis
    Cc: ink@jurassic.park.msu.ru
    Cc: rmk@arm.linux.org.uk
    Cc: starvik@axis.com
    Cc: tony.luck@intel.com
    Cc: takata@linux-m32r.org
    Cc: ralf@linux-mips.org
    Cc: grundler@parisc-linux.org
    Cc: paulus@samba.org
    Cc: schwidefsky@de.ibm.com
    Cc: lethal@linux-sh.org
    Cc: wli@holomorphy.com
    Cc: davem@davemloft.net
    Cc: jdike@addtoit.com
    Cc: mingo@redhat.com

    Rusty Russell
     

12 Dec, 2008

1 commit


08 Dec, 2008

1 commit

  • Impact: new feature

    Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
    NR_CPUS set to large values. The goal is to be able to scale up to much
    larger NR_IRQS value without impacting the (important) common case.

    To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
    irq_desc pointers.

    When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
    this also makes the IRQ descriptors NUMA-local (to the site that calls
    request_irq()).

    This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
    uses desc->chip_data for x86 to store irq_cfg.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

04 Dec, 2008

1 commit


23 Nov, 2008

1 commit

  • Impact: fix initcall debug output on non-scalar ktime platforms (32-bit embedded)

    The initcall_debug code access the tv64 member of ktime. This won't work
    correctly for large deltas on platforms that don't use the scalar ktime
    implementation.

    Signed-off-by: Will Newton
    Acked-by: Tim Bird
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Will Newton
     

19 Nov, 2008

2 commits


16 Nov, 2008

1 commit

  • Impact: new API

    Add a new API trace_mark_tp(), which declares a marker within a
    tracepoint probe. When the marker is activated, the tracepoint is
    automatically enabled.

    No branch test is used at the marker site, because it would be a
    duplicate of the branch already present in the tracepoint.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

14 Nov, 2008

3 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Inaugurate copy-on-write credentials management. This uses RCU to manage the
    credentials pointer in the task_struct with respect to accesses by other tasks.
    A process may only modify its own credentials, and so does not need locking to
    access or modify its own credentials.

    A mutex (cred_replace_mutex) is added to the task_struct to control the effect
    of PTRACE_ATTACHED on credential calculations, particularly with respect to
    execve().

    With this patch, the contents of an active credentials struct may not be
    changed directly; rather a new set of credentials must be prepared, modified
    and committed using something like the following sequence of events:

    struct cred *new = prepare_creds();
    int ret = blah(new);
    if (ret < 0) {
    abort_creds(new);
    return ret;
    }
    return commit_creds(new);

    There are some exceptions to this rule: the keyrings pointed to by the active
    credentials may be instantiated - keyrings violate the COW rule as managing
    COW keyrings is tricky, given that it is possible for a task to directly alter
    the keys in a keyring in use by another task.

    To help enforce this, various pointers to sets of credentials, such as those in
    the task_struct, are declared const. The purpose of this is compile-time
    discouragement of altering credentials through those pointers. Once a set of
    credentials has been made public through one of these pointers, it may not be
    modified, except under special circumstances:

    (1) Its reference count may incremented and decremented.

    (2) The keyrings to which it points may be modified, but not replaced.

    The only safe way to modify anything else is to create a replacement and commit
    using the functions described in Documentation/credentials.txt (which will be
    added by a later patch).

    This patch and the preceding patches have been tested with the LTP SELinux
    testsuite.

    This patch makes several logical sets of alteration:

    (1) execve().

    This now prepares and commits credentials in various places in the
    security code rather than altering the current creds directly.

    (2) Temporary credential overrides.

    do_coredump() and sys_faccessat() now prepare their own credentials and
    temporarily override the ones currently on the acting thread, whilst
    preventing interference from other threads by holding cred_replace_mutex
    on the thread being dumped.

    This will be replaced in a future patch by something that hands down the
    credentials directly to the functions being called, rather than altering
    the task's objective credentials.

    (3) LSM interface.

    A number of functions have been changed, added or removed:

    (*) security_capset_check(), ->capset_check()
    (*) security_capset_set(), ->capset_set()

    Removed in favour of security_capset().

    (*) security_capset(), ->capset()

    New. This is passed a pointer to the new creds, a pointer to the old
    creds and the proposed capability sets. It should fill in the new
    creds or return an error. All pointers, barring the pointer to the
    new creds, are now const.

    (*) security_bprm_apply_creds(), ->bprm_apply_creds()

    Changed; now returns a value, which will cause the process to be
    killed if it's an error.

    (*) security_task_alloc(), ->task_alloc_security()

    Removed in favour of security_prepare_creds().

    (*) security_cred_free(), ->cred_free()

    New. Free security data attached to cred->security.

    (*) security_prepare_creds(), ->cred_prepare()

    New. Duplicate any security data attached to cred->security.

    (*) security_commit_creds(), ->cred_commit()

    New. Apply any security effects for the upcoming installation of new
    security by commit_creds().

    (*) security_task_post_setuid(), ->task_post_setuid()

    Removed in favour of security_task_fix_setuid().

    (*) security_task_fix_setuid(), ->task_fix_setuid()

    Fix up the proposed new credentials for setuid(). This is used by
    cap_set_fix_setuid() to implicitly adjust capabilities in line with
    setuid() changes. Changes are made to the new credentials, rather
    than the task itself as in security_task_post_setuid().

    (*) security_task_reparent_to_init(), ->task_reparent_to_init()

    Removed. Instead the task being reparented to init is referred
    directly to init's credentials.

    NOTE! This results in the loss of some state: SELinux's osid no
    longer records the sid of the thread that forked it.

    (*) security_key_alloc(), ->key_alloc()
    (*) security_key_permission(), ->key_permission()

    Changed. These now take cred pointers rather than task pointers to
    refer to the security context.

    (4) sys_capset().

    This has been simplified and uses less locking. The LSM functions it
    calls have been merged.

    (5) reparent_to_kthreadd().

    This gives the current thread the same credentials as init by simply using
    commit_thread() to point that way.

    (6) __sigqueue_alloc() and switch_uid()

    __sigqueue_alloc() can't stop the target task from changing its creds
    beneath it, so this function gets a reference to the currently applicable
    user_struct which it then passes into the sigqueue struct it returns if
    successful.

    switch_uid() is now called from commit_creds(), and possibly should be
    folded into that. commit_creds() should take care of protecting
    __sigqueue_alloc().

    (7) [sg]et[ug]id() and co and [sg]et_current_groups.

    The set functions now all use prepare_creds(), commit_creds() and
    abort_creds() to build and check a new set of credentials before applying
    it.

    security_task_set[ug]id() is called inside the prepared section. This
    guarantees that nothing else will affect the creds until we've finished.

    The calling of set_dumpable() has been moved into commit_creds().

    Much of the functionality of set_user() has been moved into
    commit_creds().

    The get functions all simply access the data directly.

    (8) security_task_prctl() and cap_task_prctl().

    security_task_prctl() has been modified to return -ENOSYS if it doesn't
    want to handle a function, or otherwise return the return value directly
    rather than through an argument.

    Additionally, cap_task_prctl() now prepares a new set of credentials, even
    if it doesn't end up using it.

    (9) Keyrings.

    A number of changes have been made to the keyrings code:

    (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
    all been dropped and built in to the credentials functions directly.
    They may want separating out again later.

    (b) key_alloc() and search_process_keyrings() now take a cred pointer
    rather than a task pointer to specify the security context.

    (c) copy_creds() gives a new thread within the same thread group a new
    thread keyring if its parent had one, otherwise it discards the thread
    keyring.

    (d) The authorisation key now points directly to the credentials to extend
    the search into rather pointing to the task that carries them.

    (e) Installing thread, process or session keyrings causes a new set of
    credentials to be created, even though it's not strictly necessary for
    process or session keyrings (they're shared).

    (10) Usermode helper.

    The usermode helper code now carries a cred struct pointer in its
    subprocess_info struct instead of a new session keyring pointer. This set
    of credentials is derived from init_cred and installed on the new process
    after it has been cloned.

    call_usermodehelper_setup() allocates the new credentials and
    call_usermodehelper_freeinfo() discards them if they haven't been used. A
    special cred function (prepare_usermodeinfo_creds()) is provided
    specifically for call_usermodehelper_setup() to call.

    call_usermodehelper_setkeys() adjusts the credentials to sport the
    supplied keyring as the new session keyring.

    (11) SELinux.

    SELinux has a number of changes, in addition to those to support the LSM
    interface changes mentioned above:

    (a) selinux_setprocattr() no longer does its check for whether the
    current ptracer can access processes with the new SID inside the lock
    that covers getting the ptracer's SID. Whilst this lock ensures that
    the check is done with the ptracer pinned, the result is only valid
    until the lock is released, so there's no point doing it inside the
    lock.

    (12) is_single_threaded().

    This function has been extracted from selinux_setprocattr() and put into
    a file of its own in the lib/ directory as join_session_keyring() now
    wants to use it too.

    The code in SELinux just checked to see whether a task shared mm_structs
    with other tasks (CLONE_VM), but that isn't good enough. We really want
    to know if they're part of the same thread group (CLONE_THREAD).

    (13) nfsd.

    The NFS server daemon now has to use the COW credentials to set the
    credentials it is going to use. It really needs to pass the credentials
    down to the functions it calls, but it can't do that until other patches
    in this series have been applied.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Signed-off-by: James Morris

    David Howells
     
  • In 2007, a0acd820807680d2ccc4ef3448387fcdbf152c73 changed the default
    slab allocator to SLUB, but the SLAB help text still says SLAB is the
    default. This change fixes that.

    Signed-off-by: Simon Arlott
    Signed-off-by: Pekka Enberg

    Simon Arlott
     

13 Nov, 2008

1 commit


12 Nov, 2008

2 commits

  • Impact: Split the boot tracer entries in two parts: call and return

    Now that we are using the sched tracer from the boot tracer, we want
    to use the same timestamp than the ring-buffer to have consistent time
    captures between sched events and initcall events.

    So we get rid of the old time capture by the boot tracer and split the
    initcall events in two parts: call and return. This way we have the
    ring buffer timestamp of both.

    An example trace:

    [ 27.904149584] calling net_ns_init+0x0/0x1c0 @ 1
    [ 27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs
    [ 27.904575926] calling reboot_init+0x0/0x20 @ 1
    [ 27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs
    [ 27.904800228] calling sysctl_init+0x0/0x30 @ 1
    [ 27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.905287211] calling ksysfs_init+0x0/0xb0 @ 1
    ##### CPU 0 buffer started ####
    init-1 [000] 27.905395: 1:120:R + [001] 11:115:S
    ##### CPU 1 buffer started ####
    -0 [001] 27.905425: 0:140:R ==> [001] 11:115:R
    init-1 [000] 27.905426: 1:120:D ==> [000] 0:140:R
    -0 [000] 27.905431: 0:140:R + [000] 4:115:S
    -0 [000] 27.905451: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905456: 4:115:S ==> [000] 0:140:R
    udevd-11 [001] 27.905458: 11:115:R + [001] 14:115:R
    -0 [000] 27.905459: 0:140:R + [000] 4:115:S
    -0 [000] 27.905462: 0:140:R ==> [000] 4:115:R
    udevd-11 [001] 27.905462: 11:115:R ==> [001] 14:115:R
    ksoftirqd/0-4 [000] 27.905467: 4:115:S ==> [000] 0:140:R
    -0 [000] 27.905470: 0:140:R + [000] 4:115:S
    -0 [000] 27.905473: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905476: 4:115:S ==> [000] 0:140:R
    -0 [000] 27.905479: 0:140:R + [000] 4:115:S
    -0 [000] 27.905482: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.905486: 4:115:S ==> [000] 0:140:R
    udevd-14 [001] 27.905499: 14:120:X ==> [001] 11:115:R
    udevd-11 [001] 27.905506: 11:115:R + [000] 1:120:D
    -0 [000] 27.905515: 0:140:R ==> [000] 1:120:R
    udevd-11 [001] 27.905517: 11:115:S ==> [001] 0:140:R
    [ 27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs
    [ 27.905705736] calling init_jiffies_clocksource+0x0/0x10 @ 1
    [ 27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs
    [ 27.906769814] calling pm_init+0x0/0x30 @ 1
    [ 27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.906997803] calling pm_disk_init+0x0/0x20 @ 1
    [ 27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs
    [ 27.907222556] calling swsusp_header_init+0x0/0x30 @ 1
    [ 27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.907439620] calling stop_machine_init+0x0/0x50 @ 1
    init-1 [000] 27.907485: 1:120:R + [000] 2:115:S
    init-1 [000] 27.907490: 1:120:D ==> [000] 2:115:R
    kthreadd-2 [000] 27.907507: 2:115:R + [001] 15:115:R
    -0 [001] 27.907517: 0:140:R ==> [001] 15:115:R
    kthreadd-2 [000] 27.907517: 2:115:D ==> [000] 0:140:R
    -0 [000] 27.907521: 0:140:R + [000] 4:115:S
    -0 [000] 27.907524: 0:140:R ==> [000] 4:115:R
    udevd-15 [001] 27.907527: 15:115:D + [000] 2:115:D
    ksoftirqd/0-4 [000] 27.907537: 4:115:S ==> [000] 2:115:R
    udevd-15 [001] 27.907537: 15:115:D ==> [001] 0:140:R
    kthreadd-2 [000] 27.907546: 2:115:R + [000] 1:120:D
    kthreadd-2 [000] 27.907550: 2:115:S ==> [000] 1:120:R
    init-1 [000] 27.907584: 1:120:R + [000] 15: 0:D
    init-1 [000] 27.907589: 1:120:R + [000] 2:115:S
    init-1 [000] 27.907593: 1:120:D ==> [000] 15: 0:R
    udevd-15 [000] 27.907601: 15: 0:S ==> [000] 2:115:R
    ##### CPU 0 buffer started ####
    kthreadd-2 [000] 27.907616: 2:115:R + [001] 16:115:R
    ##### CPU 1 buffer started ####
    -0 [001] 27.907620: 0:140:R ==> [001] 16:115:R
    kthreadd-2 [000] 27.907621: 2:115:D ==> [000] 0:140:R
    udevd-16 [001] 27.907625: 16:115:D + [000] 2:115:D
    -0 [000] 27.907628: 0:140:R + [000] 4:115:S
    udevd-16 [001] 27.907629: 16:115:D ==> [001] 0:140:R
    -0 [000] 27.907631: 0:140:R ==> [000] 4:115:R
    ksoftirqd/0-4 [000] 27.907636: 4:115:S ==> [000] 2:115:R
    kthreadd-2 [000] 27.907644: 2:115:R + [000] 1:120:D
    kthreadd-2 [000] 27.907647: 2:115:S ==> [000] 1:120:R
    init-1 [000] 27.907657: 1:120:R + [001] 16: 0:D
    -0 [001] 27.907666: 0:140:R ==> [001] 16: 0:R
    [ 27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs
    [ 27.907850704] calling filelock_init+0x0/0x30 @ 1
    [ 27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs
    [ 27.908071327] calling init_script_binfmt+0x0/0x10 @ 1
    [ 27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs
    [ 27.908309461] calling init_elf_binfmt+0x0/0x10 @ 1

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Impact: Cleanups on the boot tracer and ftrace

    This patch bring some cleanups about the boot tracer headers. The
    functions and structures of this tracer have nothing related to ftrace
    and should have so their own header file.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

05 Nov, 2008

1 commit

  • Impact: modify boot tracer

    We used to disable the initcall tracing at a specified time (IE: end
    of builtin initcalls). But we don't need it anymore. It will be
    stopped when initcalls are finished.

    However we want two things:

    _Start this tracing only after pre-smp initcalls are finished.

    _Since we are planning to trace sched_switches at the same time, we
    want to enable them only during the initcall execution.

    For this purpose, this patch introduce two functions to enable/disable
    the sched_switch tracing during boot.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

02 Nov, 2008

1 commit

  • Removed duplicated #include in init/do_mounts_md.c.

    The same compile error ("error: implicit declaration of function
    'msleep'") got fixed twice:

    - f8b77d39397e1510b1a3bcfd385ebd1a45aae77f ("init/do_mounts_md.c:
    msleep compile fix")

    - 73b4a24f5ff09389ba6277c53a266b142f655ed2 ("init/do_mounts_md.c must
    #include ")

    by people adding the include in two slightly different
    places. Andrew's quilt scripts happily ignore the fuzz, and will
    re-apply the patch even though they had conflicts.

    Signed-off-by: Huang Weiyi
    Signed-off-by: Linus Torvalds

    Huang Weiyi
     

31 Oct, 2008

2 commits


26 Oct, 2008

1 commit

  • This reverts commit a802dd0eb5fc97a50cf1abb1f788a8f6cc5db635 by moving
    the call to init_workqueues() back where it belongs - after SMP has been
    initialized.

    It also moves stop_machine_init() - which needs workqueues - to a later
    phase using a core_initcall() instead of early_initcall(). That should
    satisfy all ordering requirements, and was apparently the reason why
    init_workqueues() was moved to be too early.

    Cc: Heiko Carstens
    Cc: Rusty Russell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Oct, 2008

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
    [PATCH] fs: add a sanity check in d_free
    [PATCH] i_version: remount support
    [patch] vfs: make security_inode_setattr() calling consistent
    [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
    [PATCH] move executable checking into ->permission()
    [PATCH] fs/dcache.c: update comment of d_validate()
    [RFC PATCH] touch_mnt_namespace when the mount flags change
    [PATCH] reiserfs: add missing llseek method
    [PATCH] fix ->llseek for more directories
    [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
    [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
    [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
    [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
    [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
    [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
    [PATCH] get rid of on-stack dentry in udf
    [PATCH 2/2] anondev: switch to IDA
    [PATCH 1/2] anondev: init IDR statically
    [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
    [PATCH] Optimise NFS readdir hack slightly.
    ...

    Linus Torvalds
     
  • * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (32 commits)
    PCI hotplug: fix logic in Compaq hotplug controller bus speed setup
    PCI: don't export linux/io.h from pci.h
    PCI: PCI_QUIRKS depends on PCI
    PCI hotplug: pciehp: poll data link layer link active
    PCI hotplug: pciehp: fix possible memory leak in pcie_init
    PCI: Workaround invalid P2P bridge bus numbers
    PCI Hotplug: fakephp: add duplicate slot name debugging
    PCI: Hotplug core: remove 'name'
    PCI: shcphp: remove 'name' parameter
    PCI: SGI Hotplug: stop managing bss_hotplug_slot->name
    PCI: rpaphp: kmalloc/kfree slot->name directly
    PCI: pciehp: remove 'name' parameter
    PCI: ibmphp: stop managing hotplug_slot->name
    PCI: fakephp: remove 'name' parameter
    PCI, PCI Hotplug: introduce slot_name helpers
    PCI: cpqphp: stop managing hotplug_slot->name
    PCI: cpci_hotplug: stop managing hotplug_slot->name
    PCI: acpiphp: remove 'name' parameter
    PCI: prevent duplicate slot names
    PCI Hotplug: serialize pci_hp_register and pci_hp_deregister
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    stop_machine: fix error code handling on multiple cpus
    stop_machine: use workqueues instead of kernel threads
    workqueue: introduce create_rt_workqueue
    Call init_workqueues before pre smp initcalls.
    Make panic= and panic_on_oops into core_params
    Make initcall_debug a core_param
    core_param() for genuinely core kernel parameters
    param: Fix duplicate module prefixes
    module: check kernel param length at compile time, not runtime
    Remove stop_machine during module load v2
    module: simplify load_module.

    Linus Torvalds