27 Sep, 2006

2 commits

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
    [PATCH] Don't set calgary iommu as default y
    [PATCH] i386/x86-64: New Intel feature flags
    [PATCH] x86: Add a cumulative thermal throttle event counter.
    [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
    [PATCH] x86: Refactor thermal throttle processing
    [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
    [PATCH] Fix unwinder warning in traps.c
    [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
    [PATCH] x86: Move direct PCI scanning functions out of line
    [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
    [PATCH] Don't leak NT bit into next task
    [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
    [PATCH] Fix some broken white space in ia32_signal.c
    [PATCH] Initialize argument registers for 32bit signal handlers.
    [PATCH] Remove all traces of signal number conversion
    [PATCH] Don't synchronize time reading on single core AMD systems
    [PATCH] Remove outdated comment in x86-64 mmconfig code
    [PATCH] Use string instructions for Core2 copy/clear
    [PATCH] x86: - restore i8259A eoi status on resume
    [PATCH] i386: Split multi-line printk in oops output.
    ...

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (47 commits)
    Driver core: Don't call put methods while holding a spinlock
    Driver core: Remove unneeded routines from driver core
    Driver core: Fix potential deadlock in driver core
    PCI: enable driver multi-threaded probe
    Driver Core: add ability for drivers to do a threaded probe
    sysfs: add proper sysfs_init() prototype
    drivers/base: check errors
    drivers/base: Platform notify needs to occur before drivers attach to the device
    v4l-dev2: handle __must_check
    add CONFIG_ENABLE_MUST_CHECK
    add __must_check to device management code
    Driver core: fixed add_bind_files() definition
    Driver core: fix comments in drivers/base/power/resume.c
    sysfs_remove_bin_file: no return value, dump_stack on error
    kobject: must_check fixes
    Driver core: add ability for devices to create and remove bin files
    Class: add support for class interfaces for devices
    Driver core: create devices/virtual/ tree
    Driver core: add device_rename function
    Driver core: add ability for classes to handle devices properly
    ...

    Linus Torvalds
     

26 Sep, 2006

38 commits

  • Add the pm_trace attribute in /sys/power which has to be explicitly set to
    one to really enable the "PM tracing" code compiled in when CONFIG_PM_TRACE
    is set (which modifies the machine's CMOS clock in unpredictable ways).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Change suspend_console() so that it waits for all consoles to flush the
    remaining messages and make it possible to switch the console suspending off
    with the help of a Kconfig option.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Stefan Seyfried
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make swsusp use memory bitmaps to store its internal information during the
    resume phase of the suspend-resume cycle.

    If the pfns of saveable pages are saved during the suspend phase instead of
    the kernel virtual addresses of these pages, we can use them during the resume
    phase directly to set the corresponding bits in a memory bitmap. Then, this
    bitmap is used to mark the page frames corresponding to the pages that were
    saveable before the suspend (aka "unsafe" page frames).

    Next, we allocate as many page frames as needed to store the entire suspend
    image and make sure that there will be some extra free "safe" page frames for
    the list of PBEs constructed later. Subsequently, the image is loaded and, if
    possible, the data loaded from it are written into their "original" page
    frames (ie. the ones they had occupied before the suspend).

    The image data that cannot be written into their "original" page frames are
    loaded into "safe" page frames and their "original" kernel virtual addresses,
    as well as the addresses of the "safe" pages containing their copies, are
    stored in a list of PBEs. Finally, the list of PBEs is used to copy the
    remaining image data into their "original" page frames (this is done
    atomically, by the architecture-dependent parts of swsusp).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Introduce the memory bitmap data structure and make swsusp use in the suspend
    phase.

    The current swsusp's internal data structure is not very efficient from the
    memory usage point of view, so it seems reasonable to replace it with a data
    structure that will require less memory, such as a pair of bitmaps.

    The idea is to use bitmaps that may be allocated as sets of individual pages,
    so that we can avoid making allocations of order greater than 0. For this
    reason the memory bitmap structure consists of several linked lists of objects
    that contain pointers to memory pages with the actual bitmap data. Still, for
    a typical system all of these lists fit in a single page, so it's reasonable
    to introduce an additional mechanism allowing us to allocate all of them
    efficiently without sacrificing the generality of the design. This is done
    with the help of the chain_allocator structure and associated functions.

    We need to use two memory bitmaps during the suspend phase of the
    suspend-resume cycle. One of them is necessary for marking the saveable
    pages, and the second is used to mark the pages in which to store the copies
    of them (aka image pages).

    First, the bitmaps are created and we allocate as many image pages as needed
    (the corresponding bits in the second bitmap are set as soon as the pages are
    allocated). Second, the bits corresponding to the saveable pages are set in
    the first bitmap and the saveable pages are copied to the image pages.
    Finally, the first bitmap is used to save the kernel virtual addresses of the
    saveable pages and the second one is used to save the contents of the image
    pages.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Introduce some constants that hopefully will help improve the readability of
    code in kernel/power/snapshot.c.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The name of the pagedir_nosave variable does not make sense any more, so it
    seems reasonable to change it to something more meaningful.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Get rid of the FIXME in kernel/power/snapshot.c#alloc_pagedir() and
    simplify the functions called by it.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Move some functions in kernel/power/snapshot.c to a better place (in the
    same file) and introduce free_image_page() (will be necessary in the
    future).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
    PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
    hack in kernel/power/snapshot.c#copy_data_pages().

    Additionally, the page-copying loop in copy_data_pages() is moved to an
    inline function.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The current suspend code has to be run on one CPU, so we use the CPU
    hotplug to take the non-boot CPUs offline on SMP machines. However, we
    should also make sure that these CPUs will not be enabled by someone else
    after we have disabled them.

    The functions disable_nonboot_cpus() and enable_nonboot_cpus() are moved to
    kernel/cpu.c, because they now refer to some stuff in there that should
    better be static. Also it's better if disable_nonboot_cpus() returns an
    error instead of panicking if something goes wrong, and
    enable_nonboot_cpus() has no reason to panic(), because the CPUs may have
    been enabled by the userland before it tries to take them online.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Add comments describing struct snapshot_handle and its members, change the
    confusing name of its member 'page' to 'cur'.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Clean up some loops over pfns for each zone in snapshot.c: reduce the
    number of additions to perform, rework detection of saveable pages and make
    the code a bit less difficult to understand, hopefully.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Implement async reads for swsusp resuming.

    Crufty old PIII testbox:
    15.7 MB/s -> 20.3 MB/s

    Sony Vaio:
    14.6 MB/s -> 33.3 MB/s

    I didn't implement the post-resume bio_set_pages_dirty(). I don't really
    understand why resume needs to run set_page_dirty() against these pages.

    It might be a worry that this code modifies PG_Uptodate, PG_Error and
    PG_Locked against the image pages. Can this possibly affect the resumed-into
    kernel? Hopefully not, if we're atomically restoring its mem_map?

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Jens Axboe
    Cc: Laurent Riffard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Add some instrumentation to the swsusp readin code to show what bandwidth
    we're achieving.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

    Crufty old PIII testbox:
    12.9 MB/s -> 20.9 MB/s

    Sony Vaio:
    14.7 MB/s -> 26.5 MB/s

    The implementation is crude. A better one would use larger BIOs, but wouldn't
    gain any performance.

    The memcpys will be mostly pipelined with the IO and basically come for free.

    The ENOMEM path has not been tested. It should be.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Add some instrumentation to the swsusp writeout code to show what bandwidth
    we're achieving.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Permit __do_IRQ() to be dispensed with based on a configuration option.

    Signed-off-by: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Rename selinux_ctxid_to_string to selinux_sid_to_string to be
    consistent with other interfaces.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Eliminate selinux_task_ctxid since it duplicates selinux_task_get_sid.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Profiling really suffers with off node buffers. Fail if no memory is
    available on the nodes. The profiling code can deal with these failures
    should they occur.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • The scheduler will stop load balancing if the most busy processor contains
    processes pinned via processor affinity.

    The scheduler currently only does one search for busiest cpu. If it cannot
    pull any tasks away from the busiest cpu because they were pinned then the
    scheduler goes into a corner and sulks leaving the idle processors idle.

    F.e. If you have processor 0 busy running four tasks pinned via taskset,
    there are none on processor 1 and one just started two processes on
    processor 2 then the scheduler will not move one of the two processes away
    from processor 2.

    This patch fixes that issue by forcing the scheduler to come out of its
    corner and retrying the load balancing by considering other processors for
    load balancing.

    This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

    I have removed extraneous material and gone back to equipping struct rq
    with the cpu the queue is associated with since this makes the patch much
    easier and it is likely that others in the future will have the same
    difficulty of figuring out which processor owns which runqueue.

    The overhead added through these patches is a single word on the stack if
    the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
    environments the maximum number of cpus that can be configued is 255 which
    would result in the use of 32 bytes additional on the stack. On IA64 up to
    1k cpus can be configured which will result in the use of 128 additional
    bytes on the stack. The maximum additional cache footprint is one
    cacheline. Typically memory use will be much less than a cacheline and the
    additional cpumask will be placed on the stack in a cacheline that already
    contains other local variable.

    Signed-off-by: Christoph Lameter
    Cc: John Hawkes
    Cc: "Siddha, Suresh B"
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Peter Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Current gcc generates calls not jumps to noreturn functions. When that happens the
    return address can point to the next function, which confuses the unwinder.

    This patch works around it by marking asynchronous exception
    frames in contrast normal call frames in the unwind information. Then teach
    the unwinder to decode this.

    For normal call frames the unwinder now subtracts one from the address which avoids
    this problem. The standard libgcc unwinder uses the same trick.

    It doesn't include adjustment of the printed address (i.e. for the original
    example, it'd still be kernel_math_error+0 that gets displayed, but the
    unwinder wouldn't get confused anymore.

    This only works with binutils 2.6.17+ and some versions of H.J.Lu's 2.6.16
    unfortunately because earlier binutils don't support .cfi_signal_frame

    [AK: added automatic detection of the new binutils and wrote description]

    Signed-off-by: Jan Beulich
    Signed-off-by: Andi Kleen

    Jan Beulich
     
  • GCC emits a call to a __stack_chk_fail() function when the stack canary is
    not matching the expected value.

    Since this is a bad security issue; lets panic the kernel rather than limping
    along; the kernel really can't be trusted anymore when this happens.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andi Kleen
    CC: Andi Kleen

    Arjan van de Ven
     
  • This patch adds the per thread cookie field to the task struct and the PDA.
    Also it makes sure that the PDA value gets the new cookie value at context
    switch, and that a new task gets a new cookie at task creation time.

    Signed-off-by: Arjan van Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andi Kleen
    CC: Andi Kleen

    Arjan van de Ven
     
  • The new dwarf2 unwinder needs to take locks to do backtraces
    inside modules. This patch makes sure lockdep which calls
    stacktrace is not reentered.

    Thanks to Ingo for suggesting this simpler approach.

    Cc: mingo@elte.hu
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • - Remove unused all_contexts parameter
    No caller used it
    - Move skip argument into the structure (needed for
    followon patches)

    Cc: mingo@elte.hu

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • This ports the algorithm from x86-64 (with improvements) to i386.
    Previously this only worked for frame pointer enabled kernels.
    But spinlocks have a very simple stack frame that can be manually
    analyzed. Do this.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • For NUMA optimization and some other algorithms it is useful to have a fast
    to get the current CPU and node numbers in user space.

    x86-64 added a fast way to do this in a vsyscall. This adds a generic
    syscall for other architectures to make it a generic portable facility.

    I expect some of them will also implement it as a faster vsyscall.

    The cache is an optimization for the x86-64 vsyscall optimization. Since
    what the syscall returns is an approximation anyways and user space
    often wants very fast results it can be cached for some time. The norma
    methods to get this information in user space are relatively slow

    The vsyscall is in a better position to manage the cache because it has direct
    access to a fast time stamp (jiffies). For the generic syscall optimization
    it doesn't help much, but enforce a valid argument to keep programs
    portable

    I only added an i386 syscall entry for now. Other architectures can follow
    as needed.

    AK: Also added some cleanups from Andrew Morton

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • To quote Alan Cox:

    The default Linux behaviour on an NMI of either memory or unknown is to
    continue operation. For many environments such as scientific computing
    it is preferable that the box is taken out and the error dealt with than
    an uncorrected parity/ECC error get propogated.

    A small number of systems do generate NMI's for bizarre random reasons
    such as power management so the default is unchanged. In other respects
    the new proc/sys entry works like the existing panic controls already in
    that directory.

    This is separate to the edac support - EDAC allows supported chipsets to
    handle ECC errors well, this change allows unsupported cases to at least
    panic rather than cause problems further down the line.

    Signed-off-by: Don Zickus
    Signed-off-by: Andi Kleen

    Don Zickus
     
  • Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
    watchdog.

    Signed-off-by: Don Zickus
    Signed-off-by: Andi Kleen

    Don Zickus
     
  • Removes the un/set_nmi_callback and reserve/release_lapic_nmi functions as
    they are no longer needed. The various subsystems are modified to register
    with the die_notifier instead.

    Also includes compile fixes by Andrew Morton.

    Signed-off-by: Don Zickus
    Signed-off-by: Andi Kleen

    Don Zickus
     
  • Remove the new suspend_prepare() phase. It doesn't seem very usable,
    has never been tested, doesn't address fault cleanup, and would need
    a sibling resume_complete(); plus there are no real use cases. It
    could be restored later if those issues get resolved.

    Signed-off-by: David Brownell
    Cc: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     
  • Add a new PM_SYSFS_DEPRECATED config option to control whether or
    not the /sys/devices/.../power/state files are provided. This will
    make it easier to get rid of that mechanism when the time comes,
    and to verify that userspace tools work right without it.

    Signed-off-by: David Brownell
    Acked-by: Pavel Machek
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     
  • This patch is the first of this series that should actually change any
    behavior ... by issuing the new event, now tha the rest of the kernel is
    prepared to receive it.

    This converts the PM core to issue the new PRETHAW message, which the rest of
    the kernel is now ready to receive.

    Signed-off-by: David Brownell
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     
  • Allow devices to participate in the suspend process more intimately,
    in particular, allow the final phase (with interrupts disabled) to
    also be open to normal devices, not just system devices.

    Also, allow classes to participate in device suspend.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds