27 Feb, 2010

7 commits

  • There are some dependencies between devices (in particular, between
    EHCI USB controllers and their OHCI/UHCI siblings) which are not
    reflected by the structure of the device tree. With synchronous
    suspend and resume these dependencies are taken into accout
    automatically, because the devices in question are always registered
    in the right order, but to meet these constraints with asynchronous
    suspend and resume the drivers of these devices will need to use
    dpm_wait() in their suspend/resume routines, so introduce a helper
    function allowing them to do that.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • It has been shown by testing that total device resume time can be
    reduced significantly (by as much as 50% or more) if the async
    threads executing some devices' resume routines are all started
    before the main resume thread starts to handle the "synchronous"
    devices.

    This is a consequence of the fact that the slowest devices tend to be
    located at the end of dpm_list, so their resume routines are started
    very late. Consequently, they have to wait for all the preceding
    "synchronous" devices before their resume routines can be started
    by the main resume thread, even if they are "asynchronous". By
    starting their async threads upfront we effectively move those
    devices towards the beginning of dpm_list, without breaking their
    ordering with respect to their parents and children. As a result,
    their resume routines are started much earlier and we are able to
    save much more device resume time this way.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Add configuration switch CONFIG_PM_ADVANCED_DEBUG for compiling in
    extra PM debugging/testing code allowing one to access some
    PM-related attributes of devices from the user space via sysfs.

    If CONFIG_PM_ADVANCED_DEBUG is set, add sysfs attribute power/async
    for every device allowing the user space to access the device's
    power.async_suspend flag and modify it, if desired.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Add sysfs attribute /sys/power/pm_async allowing the user space to
    disable/enable asynchronous suspend/resume of devices.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Theoretically, the total time of system sleep transitions (suspend
    to RAM, hibernation) can be reduced by running suspend and resume
    callbacks of device drivers in parallel with each other. However,
    there are dependencies between devices such that we're not allowed
    to suspend the parent of a device before suspending the device
    itself. Analogously, we're not allowed to resume a device before
    resuming its parent.

    The most straightforward way to take these dependencies into accout
    is to start the async threads used for suspending and resuming
    devices at the core level, so that async_schedule() is called for
    each suspend and resume callback supposed to be executed
    asynchronously.

    For this purpose, introduce a new device flag, power.async_suspend,
    used to mark the devices whose suspend and resume callbacks are to be
    executed asynchronously (ie. in parallel with the main suspend/resume
    thread and possibly in parallel with each other) and helper function
    device_enable_async_suspend() allowing one to set power.async_suspend
    for given device (power.async_suspend is unset by default for all
    devices). For each device with the power.async_suspend flag set the
    PM core will use async_schedule() to execute its suspend and resume
    callbacks.

    The async threads started for different devices as a result of
    calling async_schedule() are synchronized with each other and with
    the main suspend/resume thread with the help of completions, in the
    following way:
    (1) There is a completion, power.completion, for each device object.
    (2) Each device's completion is reset before calling async_schedule()
    for the device or, in the case of devices with the
    power.async_suspend flags unset, before executing the device's
    suspend and resume callbacks.
    (3) During suspend, right before running the bus type, device type
    and device class suspend callbacks for the device, the PM core
    waits for the completions of all the device's children to be
    completed.
    (4) During resume, right before running the bus type, device type and
    device class resume callbacks for the device, the PM core waits
    for the completion of the device's parent to be completed.
    (5) The PM core completes power.completion for each device right
    after the bus type, device type and device class suspend (or
    resume) callbacks executed for the device have returned.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Add parent information to the messages printed by the suspend/resume
    core when initcall_debug is set.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Add new device sysfs attribute, power/control, allowing the user
    space to block the run-time power management of the devices. If this
    attribute is set to "on", the driver of the device won't be able to power
    manage it at run time (without breaking the rules) and the device will
    always be in the full power state (except when the entire system goes
    into a sleep state).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Alan Stern

    Rafael J. Wysocki
     

17 Feb, 2010

1 commit


21 Jan, 2010

2 commits

  • This reverts commit 8ff410daa009c4b44be445ded5b0cec00abc0426

    It should not have been sent to Linus's tree yet, as it depends
    on changes that are queued up in my driver-core for the .34 kernel
    merge.

    Cc: Wu Fengguang
    Cc: Andi Kleen
    Cc: "Zheng, Shaohui"
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • On Mon, Jan 18, 2010 at 05:26:20PM +0530, Sachin Sant wrote:
    > Hello Heiko,
    >
    > Today while trying to boot next-20100118 i came across
    > the following Oops :
    >
    > Brought up 4 CPUs
    > Unable to handle kernel pointer dereference at virtual kernel address 0000000000
    > 543000
    > Oops: 0004 #1 SMP
    > Modules linked in:
    > CPU: 0 Not tainted 2.6.33-rc4-autotest-next-20100118-5-default #1
    > Process swapper (pid: 1, task: 00000000fd792038, ksp: 00000000fd797a30)
    > Krnl PSW : 0704200180000000 00000000001eb0b8 (shmem_parse_options+0xc0/0x328)
    > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
    > Krnl GPRS: 000000000054388a 000000000000003d 0000000000543836 000000000000003d
    > 0000000000000000 0000000000483f28 0000000000536112 00000000fd797d00
    > 00000000fd4ba100 0000000000000100 0000000000483978 0000000000543832
    > 0000000000000000 0000000000465958 00000000001eb0b0 00000000fd797c58
    > Krnl Code: 00000000001eb0aa: c0e5000994f1 brasl %r14,31da8c
    > 00000000001eb0b0: b9020022 ltgr %r2,%r2
    > 00000000001eb0b4: a784010b brc 8,1eb2ca
    > >00000000001eb0b8: 92002000 mvi 0(%r2),0
    > 00000000001eb0bc: a7080000 lhi %r0,0
    > 00000000001eb0c0: 41902001 la %r9,1(%r2)
    > 00000000001eb0c4: b9040016 lgr %r1,%r6
    > 00000000001eb0c8: b904002b lgr %r2,%r11
    > Call Trace:
    > ( 0xfd797c50)
    > shmem_fill_super+0x13a/0x25c
    > get_sb_single+0xbe/0xdc
    > dev_get_sb+0x2c/0x38
    > devtmpfs_init+0x46/0xc0
    > driver_init+0x22/0x60
    > kernel_init+0x24e/0x3d0
    > kernel_thread_starter+0x6/0xc
    > kernel_thread_starter+0x0/0xc
    >
    > I never tried to boot a kernel with DEVTMPFS enabled on a s390 box.
    > So am wondering if this is supported or not ? If you think this
    > is supported i will send a mail to community on this.

    There is nothing arch specific to devtmpfs. This part crashes because the
    kernel tries to modify the data read-only section which is write protected
    on s390.

    Signed-off-by: Heiko Carstens
    Acked-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     

17 Jan, 2010

2 commits

  • The function prototype mismatches in call stack:

    [] print_block_size+0x58/0x60
    [] sysdev_class_show+0x1f/0x30
    [] sysfs_read_file+0xcb/0x1f0
    [] vfs_read+0xc8/0x180

    Due to prototype mismatch, print_block_size() will sprintf() into
    *attribute instead of *buf, hence user space will read the initial
    zeros from *buf:
    $ hexdump /sys/devices/system/memory/block_size_bytes
    0000000 0000 0000 0000 0000
    0000008

    After patch:
    cat /sys/devices/system/memory/block_size_bytes
    0x8000000

    This complements commits c29af9636 and 4a0b2b4dbe.

    Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: Greg Kroah-Hartman
    Cc: "Zheng, Shaohui"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

12 Jan, 2010

1 commit

  • Warning(drivers/base/power/main.c:453): No description found for parameter 'dev'
    Warning(drivers/base/power/main.c:453): No description found for parameter 'cb'
    Warning(drivers/base/power/main.c:719): No description found for parameter 'dev'
    Warning(drivers/base/power/main.c:719): No description found for parameter 'state'
    Warning(drivers/base/power/main.c:719): No description found for parameter 'cb'

    Signed-off-by: Randy Dunlap
    Cc: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

24 Dec, 2009

8 commits


23 Dec, 2009

2 commits


21 Dec, 2009

1 commit

  • This patch (as1317) fixes a bug in the PM core. When a device is
    resumed following a system sleep, the core decrements the device's
    runtime PM usage counter but doesn't issue an idle notification if the
    counter reaches 0. This could prevent an otherwise unused device from
    being runtime-suspended again after the system sleep.

    The fix is to call pm_runtime_put_sync() instead of
    pm_runtime_put_noidle().

    Signed-off-by: Alan Stern
    Signed-off-by: Rafael J. Wysocki

    Alan Stern
     

18 Dec, 2009

3 commits

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     
  • Measure and print the time of suspending and resuming all devices.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Commit f2511774863487e61b56a97da07ebf8dd61d7836
    (PM: Add initcall_debug style timing for suspend/resume) introduced
    basic timing instrumentation, needed for a scritps/bootgraph.pl
    equivalent or humans, but it missed the fact that bus types and
    device classes which haven't been switched to using struct dev_pm_ops
    objects yet need special handling. As a result, the suspend/resume
    timing information is only available for devices whose bus types or
    device classes use struct dev_pm_ops objects, so the majority of
    devices is not covered.

    Fix this by adding basic suspend/resume timing instrumentation for
    devices whose bus types and device classes still don't use struct
    dev_pm_ops objects for power management. To reduce code duplication
    move the timing code to helper functions.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

17 Dec, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     

16 Dec, 2009

12 commits

  • This is a simpler, gentler variant of memory_failure() for soft page
    offlining controlled from user space. It doesn't kill anything, just
    tries to invalidate and if that doesn't work migrate the
    page away.

    This is useful for predictive failure analysis, where a page has
    a high rate of corrected errors, but hasn't gone bad yet. Instead
    it can be offlined early and avoided.

    The offlining is controlled from sysfs, including a new generic
    entry point for hard page offlining for symmetry too.

    We use the page isolate facility to prevent re-allocation
    race. Normally this is only used by memory hotplug. To avoid
    races with memory allocation I am using lock_system_sleep().
    This avoids the situation where memory hotplug is about
    to isolate a page range and then hwpoison undoes that work.
    This is a big hammer currently, but the simplest solution
    currently.

    When the page is not free or LRU we try to free pages
    from slab and other caches. The slab freeing is currently
    quite dumb and does not try to focus on the specific slab
    cache which might own the page. This could be potentially
    improved later.

    Thanks to Fengguang Wu and Haicheng Li for some fixes.

    [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • It is not necessary to include into
    drivers/base/power/main.c, so don't do that.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • In device_resume_noirq() there is the 'End' label and the associated
    goto statement that aren't strictly necessary, so rework the code to
    get rid of them. Also modify device_suspend_noirq() so that it looks
    completely analogous to device_resume_noirq().

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • In order to diagnose overall suspend/resume times, we need
    basic instrumentation to break down the total time into per
    device timing, similar to initcall_debug.

    This patch adds the basic timing instrumentation, needed
    for a scritps/bootgraph.pl equivalent or humans.
    The bootgraph.pl program is still a work in progress, but
    is far enough along to know that this patch is sufficient.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Rafael J. Wysocki

    Arjan van de Ven
     
  • This patch (as1308c) fixes __pm_runtime_get(). Currently the routine
    will resume a device if the prior usage count was 0. But this isn't
    right; thanks to pm_runtime_get_noresume() the usage count can be
    positive even while the device is suspended.

    Signed-off-by: Alan Stern
    Signed-off-by: Rafael J. Wysocki

    Alan Stern
     
  • Nodemasks should not be allocated on the stack for large systems (when it
    is larger than 256 bytes) since there is a threat of overflow.

    This patch causes the unregister_mem_sect_under_nodes() nodemask to be
    allocated on the stack for smaller systems and be allocated by slab for
    larger systems.

    GFP_KERNEL is used since remove_memory_block() can block.

    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Alex Chiang
    Signed-off-by: David Rientjes
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • You can discover which CPUs belong to a NUMA node by examining
    /sys/devices/system/node/node#/

    However, it's not convenient to go in the other direction, when looking at
    /sys/devices/system/cpu/cpu#/

    Yes, you can muck about in sysfs, but adding these symlinks makes life a
    lot more convenient.

    Signed-off-by: Alex Chiang
    Acked-by: David Rientjes
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by two levels.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by one level.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • Commit c04fc586c (mm: show node to memory section relationship with
    symlinks in sysfs) created symlinks from nodes to memory sections, e.g.

    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135

    If you're examining the memory section though and are wondering what node
    it might belong to, you can find it by grovelling around in sysfs, but
    it's a little cumbersome.

    Add a reverse symlink for each memory section that points back to the
    node to which it belongs.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Acked-by: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • Offload the registration and unregistration of per node hstate sysfs
    attributes to a worker thread rather than attempt the
    allocation/attachment or detachment/freeing of the attributes in the
    context of the memory hotplug handler.

    I don't know that this is absolutely required, but the registration can
    sleep in allocations and other mem hot plug handlers do it this way. If
    it turns out this is NOT required, we can drop this patch.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Register per node hstate attributes only for nodes with memory. As
    suggested by David Rientjes.

    With Memory Hotplug, memory can be added to a memoryless node and a node
    with memory can become memoryless. Therefore, add a memory on/off-line
    notifier callback to [un]register a node's attributes on transition
    to/from memoryless state.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn