Commit b79995700e25dd6b0b0aff7edd0c102d1b6281f7

Authored by Rafael J. Wysocki
Committed by Jesse Barnes
1 parent 5030718ee4

PM/PCI: Update PCI power management documentation

I power management document, Documentation/power/pci.txt, is
outdated and partially inaccurate.  It also is missing some important
information about the power management of PCI device.  Rewrite it to
make it more up to date and more complete.

Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

Showing 1 changed file with 992 additions and 266 deletions Side-by-side Diff

Documentation/power/pci.txt
Changes suppressed. Click to show
1   -
2 1 PCI Power Management
3   -~~~~~~~~~~~~~~~~~~~~
4 2  
5   -An overview of the concepts and the related functions in the Linux kernel
  3 +Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
6 4  
7   -Patrick Mochel <mochel@transmeta.com>
8   -(and others)
  5 +An overview of concepts and the Linux kernel's interfaces related to PCI power
  6 +management. Based on previous work by Patrick Mochel <mochel@transmeta.com>
  7 +(and others).
9 8  
  9 +This document only covers the aspects of power management specific to PCI
  10 +devices. For general description of the kernel's interfaces related to device
  11 +power management refer to Documentation/power/devices.txt and
  12 +Documentation/power/runtime_pm.txt.
  13 +
10 14 ---------------------------------------------------------------------------
11 15  
12   -1. Overview
13   -2. How the PCI Subsystem Does Power Management
14   -3. PCI Utility Functions
15   -4. PCI Device Drivers
16   -5. Resources
  16 +1. Hardware and Platform Support for PCI Power Management
  17 +2. PCI Subsystem and Device Power Management
  18 +3. PCI Device Drivers and Power Management
  19 +4. Resources
17 20  
18   -1. Overview
19   -~~~~~~~~~~~
20 21  
21   -The PCI Power Management Specification was introduced between the PCI 2.1 and
22   -PCI 2.2 Specifications. It a standard interface for controlling various
23   -power management operations.
  22 +1. Hardware and Platform Support for PCI Power Management
  23 +=========================================================
24 24  
25   -Implementation of the PCI PM Spec is optional, as are several sub-components of
26   -it. If a device supports the PCI PM Spec, the device will have an 8 byte
27   -capability field in its PCI configuration space. This field is used to describe
28   -and control the standard PCI power management features.
  25 +1.1. Native and Platform-Based Power Management
  26 +-----------------------------------------------
  27 +In general, power management is a feature allowing one to save energy by putting
  28 +devices into states in which they draw less power (low-power states) at the
  29 +price of reduced functionality or performance.
29 30  
30   -The PCI PM spec defines 4 operating states for devices (D0 - D3) and for buses
31   -(B0 - B3). The higher the number, the less power the device consumes. However,
32   -the higher the number, the longer the latency is for the device to return to
33   -an operational state (D0).
  31 +Usually, a device is put into a low-power state when it is underutilized or
  32 +completely inactive. However, when it is necessary to use the device once
  33 +again, it has to be put back into the "fully functional" state (full-power
  34 +state). This may happen when there are some data for the device to handle or
  35 +as a result of an external event requiring the device to be active, which may
  36 +be signaled by the device itself.
34 37  
35   -There are actually two D3 states. When someone talks about D3, they usually
36   -mean D3hot, which corresponds to an ACPI D2 state (power is reduced, the
37   -device may lose some context). But they may also mean D3cold, which is an
38   -ACPI D3 state (power is fully off, all state was discarded); or both.
  38 +PCI devices may be put into low-power states in two ways, by using the device
  39 +capabilities introduced by the PCI Bus Power Management Interface Specification,
  40 +or with the help of platform firmware, such as an ACPI BIOS. In the first
  41 +approach, that is referred to as the native PCI power management (native PCI PM)
  42 +in what follows, the device power state is changed as a result of writing a
  43 +specific value into one of its standard configuration registers. The second
  44 +approach requires the platform firmware to provide special methods that may be
  45 +used by the kernel to change the device's power state.
39 46  
40   -Bus power management is not covered in this version of this document.
  47 +Devices supporting the native PCI PM usually can generate wakeup signals called
  48 +Power Management Events (PMEs) to let the kernel know about external events
  49 +requiring the device to be active. After receiving a PME the kernel is supposed
  50 +to put the device that sent it into the full-power state. However, the PCI Bus
  51 +Power Management Interface Specification doesn't define any standard method of
  52 +delivering the PME from the device to the CPU and the operating system kernel.
  53 +It is assumed that the platform firmware will perform this task and therefore,
  54 +even though a PCI device is set up to generate PMEs, it also may be necessary to
  55 +prepare the platform firmware for notifying the CPU of the PMEs coming from the
  56 +device (e.g. by generating interrupts).
41 57  
42   -Note that all PCI devices support D0 and D3cold by default, regardless of
43   -whether or not they implement any of the PCI PM spec.
  58 +In turn, if the methods provided by the platform firmware are used for changing
  59 +the power state of a device, usually the platform also provides a method for
  60 +preparing the device to generate wakeup signals. In that case, however, it
  61 +often also is necessary to prepare the device for generating PMEs using the
  62 +native PCI PM mechanism, because the method provided by the platform depends on
  63 +that.
44 64  
45   -The possible state transitions that a device can undergo are:
  65 +Thus in many situations both the native and the platform-based power management
  66 +mechanisms have to be used simultaneously to obtain the desired result.
46 67  
47   -+---------------------------+
48   -| Current State | New State |
49   -+---------------------------+
50   -| D0 | D1, D2, D3|
51   -+---------------------------+
52   -| D1 | D2, D3 |
53   -+---------------------------+
54   -| D2 | D3 |
55   -+---------------------------+
56   -| D1, D2, D3 | D0 |
57   -+---------------------------+
  68 +1.2. Native PCI Power Management
  69 +--------------------------------
  70 +The PCI Bus Power Management Interface Specification (PCI PM Spec) was
  71 +introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a
  72 +standard interface for performing various operations related to power
  73 +management.
58 74  
59   -Note that when the system is entering a global suspend state, all devices will
60   -be placed into D3 and when resuming, all devices will be placed into D0.
61   -However, when the system is running, other state transitions are possible.
  75 +The implementation of the PCI PM Spec is optional for conventional PCI devices,
  76 +but it is mandatory for PCI Express devices. If a device supports the PCI PM
  77 +Spec, it has an 8 byte power management capability field in its PCI
  78 +configuration space. This field is used to describe and control the standard
  79 +features related to the native PCI power management.
62 80  
63   -2. How The PCI Subsystem Handles Power Management
64   -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  81 +The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
  82 +(B0-B3). The higher the number, the less power is drawn by the device or bus
  83 +in that state. However, the higher the number, the longer the latency for
  84 +the device or bus to return to the full-power state (D0 or B0, respectively).
65 85  
66   -The PCI suspend/resume functionality is accessed indirectly via the Power
67   -Management subsystem. At boot, the PCI driver registers a power management
68   -callback with that layer. Upon entering a suspend state, the PM layer iterates
69   -through all of its registered callbacks. This currently takes place only during
70   -APM state transitions.
  86 +There are two variants of the D3 state defined by the specification. The first
  87 +one is D3hot, referred to as the software accessible D3, because devices can be
  88 +programmed to go into it. The second one, D3cold, is the state that PCI devices
  89 +are in when the supply voltage (Vcc) is removed from them. It is not possible
  90 +to program a PCI device to go into D3cold, although there may be a programmable
  91 +interface for putting the bus the device is on into a state in which Vcc is
  92 +removed from all devices on the bus.
71 93  
72   -Upon going to sleep, the PCI subsystem walks its device tree twice. Both times,
73   -it does a depth first walk of the device tree. The first walk saves each of the
74   -device's state and checks for devices that will prevent the system from entering
75   -a global power state. The next walk then places the devices in a low power
76   -state.
  94 +PCI bus power management, however, is not supported by the Linux kernel at the
  95 +time of this writing and therefore it is not covered by this document.
77 96  
78   -The first walk allows a graceful recovery in the event of a failure, since none
79   -of the devices have actually been powered down.
  97 +Note that every PCI device can be in the full-power state (D0) or in D3cold,
  98 +regardless of whether or not it implements the PCI PM Spec. In addition to
  99 +that, if the PCI PM Spec is implemented by the device, it must support D3hot
  100 +as well as D0. The support for the D1 and D2 power states is optional.
80 101  
81   -In both walks, in particular the second, all children of a bridge are touched
82   -before the actual bridge itself. This allows the bridge to retain power while
83   -its children are being accessed.
  102 +PCI devices supporting the PCI PM Spec can be programmed to go to any of the
  103 +supported low-power states (except for D3cold). While in D1-D3hot the
  104 +standard configuration registers of the device must be accessible to software
  105 +(i.e. the device is required to respond to PCI configuration accesses), although
  106 +its I/O and memory spaces are then disabled. This allows the device to be
  107 +programmatically put into D0. Thus the kernel can switch the device back and
  108 +forth between D0 and the supported low-power states (except for D3cold) and the
  109 +possible power state transitions the device can undergo are the following:
84 110  
85   -Upon resuming from sleep, just the opposite must be true: all bridges must be
86   -powered on and restored before their children are powered on. This is easily
87   -accomplished with a breadth-first walk of the PCI device tree.
  111 ++----------------------------+
  112 +| Current State | New State |
  113 ++----------------------------+
  114 +| D0 | D1, D2, D3 |
  115 ++----------------------------+
  116 +| D1 | D2, D3 |
  117 ++----------------------------+
  118 +| D2 | D3 |
  119 ++----------------------------+
  120 +| D1, D2, D3 | D0 |
  121 ++----------------------------+
88 122  
  123 +The transition from D3cold to D0 occurs when the supply voltage is provided to
  124 +the device (i.e. power is restored). In that case the device returns to D0 with
  125 +a full power-on reset sequence and the power-on defaults are restored to the
  126 +device by hardware just as at initial power up.
89 127  
90   -3. PCI Utility Functions
91   -~~~~~~~~~~~~~~~~~~~~~~~~
  128 +PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
  129 +while in a low-power state (D1-D3), but they are not required to be capable
  130 +of generating PMEs from all supported low-power states. In particular, the
  131 +capability of generating PMEs from D3cold is optional and depends on the
  132 +presence of additional voltage (3.3Vaux) allowing the device to remain
  133 +sufficiently active to generate a wakeup signal.
92 134  
93   -These are helper functions designed to be called by individual device drivers.
94   -Assuming that a device behaves as advertised, these should be applicable in most
95   -cases. However, results may vary.
  135 +1.3. ACPI Device Power Management
  136 +---------------------------------
  137 +The platform firmware support for the power management of PCI devices is
  138 +system-specific. However, if the system in question is compliant with the
  139 +Advanced Configuration and Power Interface (ACPI) Specification, like the
  140 +majority of x86-based systems, it is supposed to implement device power
  141 +management interfaces defined by the ACPI standard.
96 142  
97   -Note that these functions are never implicitly called for the driver. The driver
98   -is always responsible for deciding when and if to call these.
  143 +For this purpose the ACPI BIOS provides special functions called "control
  144 +methods" that may be executed by the kernel to perform specific tasks, such as
  145 +putting a device into a low-power state. These control methods are encoded
  146 +using special byte-code language called the ACPI Machine Language (AML) and
  147 +stored in the machine's BIOS. The kernel loads them from the BIOS and executes
  148 +them as needed using an AML interpreter that translates the AML byte code into
  149 +computations and memory or I/O space accesses. This way, in theory, a BIOS
  150 +writer can provide the kernel with a means to perform actions depending
  151 +on the system design in a system-specific fashion.
99 152  
  153 +ACPI control methods may be divided into global control methods, that are not
  154 +associated with any particular devices, and device control methods, that have
  155 +to be defined separately for each device supposed to be handled with the help of
  156 +the platform. This means, in particular, that ACPI device control methods can
  157 +only be used to handle devices that the BIOS writer knew about in advance. The
  158 +ACPI methods used for device power management fall into that category.
100 159  
101   -pci_save_state
102   ---------------
  160 +The ACPI specification assumes that devices can be in one of four power states
  161 +labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
  162 +D0-D3 states (although the difference between D3hot and D3cold is not taken
  163 +into account by ACPI). Moreover, for each power state of a device there is a
  164 +set of power resources that have to be enabled for the device to be put into
  165 +that state. These power resources are controlled (i.e. enabled or disabled)
  166 +with the help of their own control methods, _ON and _OFF, that have to be
  167 +defined individually for each of them.
103 168  
104   -Usage:
105   - pci_save_state(struct pci_dev *dev);
  169 +To put a device into the ACPI power state Dx (where x is a number between 0 and
  170 +3 inclusive) the kernel is supposed to (1) enable the power resources required
  171 +by the device in this state using their _ON control methods and (2) execute the
  172 +_PSx control method defined for the device. In addition to that, if the device
  173 +is going to be put into a low-power state (D1-D3) and is supposed to generate
  174 +wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
  175 +3.0) control method defined for it has to be executed before _PSx. Power
  176 +resources that are not required by the device in the target power state and are
  177 +not required any more by any other device should be disabled (by executing their
  178 +_OFF control methods). If the current power state of the device is D3, it can
  179 +only be put into D0 this way.
106 180  
107   -Description:
108   - Save first 64 bytes of PCI config space, along with any additional
109   - PCI-Express or PCI-X information.
  181 +However, quite often the power states of devices are changed during a
  182 +system-wide transition into a sleep state or back into the working state. ACPI
  183 +defines four system sleep states, S1, S2, S3, and S4, and denotes the system
  184 +working state as S0. In general, the target system sleep (or working) state
  185 +determines the highest power (lowest number) state the device can be put
  186 +into and the kernel is supposed to obtain this information by executing the
  187 +device's _SxD control method (where x is a number between 0 and 4 inclusive).
  188 +If the device is required to wake up the system from the target sleep state, the
  189 +lowest power (highest number) state it can be put into is also determined by the
  190 +target state of the system. The kernel is then supposed to use the device's
  191 +_SxW control method to obtain the number of that state. It also is supposed to
  192 +use the device's _PRW control method to learn which power resources need to be
  193 +enabled for the device to be able to generate wakeup signals.
110 194  
  195 +1.4. Wakeup Signaling
  196 +---------------------
  197 +Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
  198 +a result of the execution of the _DSW (or _PSW) ACPI control method before
  199 +putting the device into a low-power state, have to be caught and handled as
  200 +appropriate. If they are sent while the system is in the working state
  201 +(ACPI S0), they should be translated into interrupts so that the kernel can
  202 +put the devices generating them into the full-power state and take care of the
  203 +events that triggered them. In turn, if they are sent while the system is
  204 +sleeping, they should cause the system's core logic to trigger wakeup.
111 205  
112   -pci_restore_state
113   ------------------
  206 +On ACPI-based systems wakeup signals sent by conventional PCI devices are
  207 +converted into ACPI General-Purpose Events (GPEs) which are hardware signals
  208 +from the system core logic generated in response to various events that need to
  209 +be acted upon. Every GPE is associated with one or more sources of potentially
  210 +interesting events. In particular, a GPE may be associated with a PCI device
  211 +capable of signaling wakeup. The information on the connections between GPEs
  212 +and event sources is recorded in the system's ACPI BIOS from where it can be
  213 +read by the kernel.
114 214  
115   -Usage:
116   - pci_restore_state(struct pci_dev *dev);
  215 +If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
  216 +associated with it (if there is one) is triggered. The GPEs associated with PCI
  217 +bridges may also be triggered in response to a wakeup signal from one of the
  218 +devices below the bridge (this also is the case for root bridges) and, for
  219 +example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
  220 +handled this way.
117 221  
118   -Description:
119   - Restore previously saved config space.
  222 +A GPE may be triggered when the system is sleeping (i.e. when it is in one of
  223 +the ACPI S1-S4 states), in which case system wakeup is started by its core logic
  224 +(the device that was the source of the signal causing the system wakeup to occur
  225 +may be identified later). The GPEs used in such situations are referred to as
  226 +wakeup GPEs.
120 227  
  228 +Usually, however, GPEs are also triggered when the system is in the working
  229 +state (ACPI S0) and in that case the system's core logic generates a System
  230 +Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI
  231 +handler identifies the GPE that caused the interrupt to be generated which,
  232 +in turn, allows the kernel to identify the source of the event (that may be
  233 +a PCI device signaling wakeup). The GPEs used for notifying the kernel of
  234 +events occurring while the system is in the working state are referred to as
  235 +runtime GPEs.
121 236  
122   -pci_set_power_state
123   --------------------
  237 +Unfortunately, there is no standard way of handling wakeup signals sent by
  238 +conventional PCI devices on systems that are not ACPI-based, but there is one
  239 +for PCI Express devices. Namely, the PCI Express Base Specification introduced
  240 +a native mechanism for converting native PCI PMEs into interrupts generated by
  241 +root ports. For conventional PCI devices native PMEs are out-of-band, so they
  242 +are routed separately and they need not pass through bridges (in principle they
  243 +may be routed directly to the system's core logic), but for PCI Express devices
  244 +they are in-band messages that have to pass through the PCI Express hierarchy,
  245 +including the root port on the path from the device to the Root Complex. Thus
  246 +it was possible to introduce a mechanism by which a root port generates an
  247 +interrupt whenever it receives a PME message from one of the devices below it.
  248 +The PCI Express Requester ID of the device that sent the PME message is then
  249 +recorded in one of the root port's configuration registers from where it may be
  250 +read by the interrupt handler allowing the device to be identified. [PME
  251 +messages sent by PCI Express endpoints integrated with the Root Complex don't
  252 +pass through root ports, but instead they cause a Root Complex Event Collector
  253 +(if there is one) to generate interrupts.]
124 254  
125   -Usage:
126   - pci_set_power_state(struct pci_dev *dev, pci_power_t state);
  255 +In principle the native PCI Express PME signaling may also be used on ACPI-based
  256 +systems along with the GPEs, but to use it the kernel has to ask the system's
  257 +ACPI BIOS to release control of root port configuration registers. The ACPI
  258 +BIOS, however, is not required to allow the kernel to control these registers
  259 +and if it doesn't do that, the kernel must not modify their contents. Of course
  260 +the native PCI Express PME signaling cannot be used by the kernel in that case.
127 261  
128   -Description:
129   - Transition device to low power state using PCI PM Capabilities
130   - registers.
131 262  
132   - Will fail under one of the following conditions:
133   - - If state is less than current state, but not D0 (illegal transition)
134   - - Device doesn't support PM Capabilities
135   - - Device does not support requested state
  263 +2. PCI Subsystem and Device Power Management
  264 +============================================
136 265  
  266 +2.1. Device Power Management Callbacks
  267 +--------------------------------------
  268 +The PCI Subsystem participates in the power management of PCI devices in a
  269 +number of ways. First of all, it provides an intermediate code layer between
  270 +the device power management core (PM core) and PCI device drivers.
  271 +Specifically, the pm field of the PCI subsystem's struct bus_type object,
  272 +pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
  273 +pointers to several device power management callbacks:
137 274  
138   -pci_enable_wake
139   ----------------
  275 +const struct dev_pm_ops pci_dev_pm_ops = {
  276 + .prepare = pci_pm_prepare,
  277 + .complete = pci_pm_complete,
  278 + .suspend = pci_pm_suspend,
  279 + .resume = pci_pm_resume,
  280 + .freeze = pci_pm_freeze,
  281 + .thaw = pci_pm_thaw,
  282 + .poweroff = pci_pm_poweroff,
  283 + .restore = pci_pm_restore,
  284 + .suspend_noirq = pci_pm_suspend_noirq,
  285 + .resume_noirq = pci_pm_resume_noirq,
  286 + .freeze_noirq = pci_pm_freeze_noirq,
  287 + .thaw_noirq = pci_pm_thaw_noirq,
  288 + .poweroff_noirq = pci_pm_poweroff_noirq,
  289 + .restore_noirq = pci_pm_restore_noirq,
  290 + .runtime_suspend = pci_pm_runtime_suspend,
  291 + .runtime_resume = pci_pm_runtime_resume,
  292 + .runtime_idle = pci_pm_runtime_idle,
  293 +};
140 294  
141   -Usage:
142   - pci_enable_wake(struct pci_dev *dev, pci_power_t state, int enable);
  295 +These callbacks are executed by the PM core in various situations related to
  296 +device power management and they, in turn, execute power management callbacks
  297 +provided by PCI device drivers. They also perform power management operations
  298 +involving some standard configuration registers of PCI devices that device
  299 +drivers need not know or care about.
143 300  
144   -Description:
145   - Enable device to generate PME# during low power state using PCI PM
146   - Capabilities.
  301 +The structure representing a PCI device, struct pci_dev, contains several fields
  302 +that these callbacks operate on:
147 303  
148   - Checks whether if device supports generating PME# from requested state
149   - and fail if it does not, unless enable == 0 (request is to disable wake
150   - events, which is implicit if it doesn't even support it in the first
151   - place).
  304 +struct pci_dev {
  305 + ...
  306 + pci_power_t current_state; /* Current operating state. */
  307 + int pm_cap; /* PM capability offset in the
  308 + configuration space */
  309 + unsigned int pme_support:5; /* Bitmask of states from which PME#
  310 + can be generated */
  311 + unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */
  312 + unsigned int d1_support:1; /* Low power state D1 is supported */
  313 + unsigned int d2_support:1; /* Low power state D2 is supported */
  314 + unsigned int no_d1d2:1; /* D1 and D2 are forbidden */
  315 + unsigned int wakeup_prepared:1; /* Device prepared for wake up */
  316 + unsigned int d3_delay; /* D3->D0 transition time in ms */
  317 + ...
  318 +};
152 319  
153   - Note that the PMC Register in the device's PM Capabilities has a bitmask
154   - of the states it supports generating PME# from. D3hot is bit 3 and
155   - D3cold is bit 4. So, while a value of 4 as the state may not seem
156   - semantically correct, it is.
  320 +They also indirectly use some fields of the struct device that is embedded in
  321 +struct pci_dev.
157 322  
  323 +2.2. Device Initialization
  324 +--------------------------
  325 +The PCI subsystem's first task related to device power management is to
  326 +prepare the device for power management and initialize the fields of struct
  327 +pci_dev used for this purpose. This happens in two functions defined in
  328 +drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
158 329  
159   -4. PCI Device Drivers
160   -~~~~~~~~~~~~~~~~~~~~~
  330 +The first of these functions checks if the device supports native PCI PM
  331 +and if that's the case the offset of its power management capability structure
  332 +in the configuration space is stored in the pm_cap field of the device's struct
  333 +pci_dev object. Next, the function checks which PCI low-power states are
  334 +supported by the device and from which low-power states the device can generate
  335 +native PCI PMEs. The power management fields of the device's struct pci_dev and
  336 +the struct device embedded in it are updated accordingly and the generation of
  337 +PMEs by the device is disabled.
161 338  
162   -These functions are intended for use by individual drivers, and are defined in
163   -struct pci_driver:
  339 +The second function checks if the device can be prepared to signal wakeup with
  340 +the help of the platform firmware, such as the ACPI BIOS. If that is the case,
  341 +the function updates the wakeup fields in struct device embedded in the
  342 +device's struct pci_dev and uses the firmware-provided method to prevent the
  343 +device from signaling wakeup.
164 344  
165   - int (*suspend) (struct pci_dev *dev, pm_message_t state);
166   - int (*resume) (struct pci_dev *dev);
  345 +At this point the device is ready for power management. For driverless devices,
  346 +however, this functionality is limited to a few basic operations carried out
  347 +during system-wide transitions to a sleep state and back to the working state.
167 348  
  349 +2.3. Runtime Device Power Management
  350 +------------------------------------
  351 +The PCI subsystem plays a vital role in the runtime power management of PCI
  352 +devices. For this purpose it uses the general runtime power management
  353 +(runtime PM) framework described in Documentation/power/runtime_pm.txt.
  354 +Namely, it provides subsystem-level callbacks:
168 355  
169   -suspend
170   --------
  356 + pci_pm_runtime_suspend()
  357 + pci_pm_runtime_resume()
  358 + pci_pm_runtime_idle()
171 359  
172   -Usage:
  360 +that are executed by the core runtime PM routines. It also implements the
  361 +entire mechanics necessary for handling runtime wakeup signals from PCI devices
  362 +in low-power states, which at the time of this writing works for both the native
  363 +PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
  364 +Section 1.
173 365  
174   -if (dev->driver && dev->driver->suspend)
175   - dev->driver->suspend(dev,state);
  366 +First, a PCI device is put into a low-power state, or suspended, with the help
  367 +of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
  368 +pci_pm_runtime_suspend() to do the actual job. For this to work, the device's
  369 +driver has to provide a pm->runtime_suspend() callback (see below), which is
  370 +run by pci_pm_runtime_suspend() as the first action. If the driver's callback
  371 +returns successfully, the device's standard configuration registers are saved,
  372 +the device is prepared to generate wakeup signals and, finally, it is put into
  373 +the target low-power state.
176 374  
177   -A driver uses this function to actually transition the device into a low power
178   -state. This should include disabling I/O, IRQs, and bus-mastering, as well as
179   -physically transitioning the device to a lower power state; it may also include
180   -calls to pci_enable_wake().
  375 +The low-power state to put the device into is the lowest-power (highest number)
  376 +state from which it can signal wakeup. The exact method of signaling wakeup is
  377 +system-dependent and is determined by the PCI subsystem on the basis of the
  378 +reported capabilities of the device and the platform firmware. To prepare the
  379 +device for signaling wakeup and put it into the selected low-power state, the
  380 +PCI subsystem can use the platform firmware as well as the device's native PCI
  381 +PM capabilities, if supported.
181 382  
182   -Bus mastering may be disabled by doing:
  383 +It is expected that the device driver's pm->runtime_suspend() callback will
  384 +not attempt to prepare the device for signaling wakeup or to put it into a
  385 +low-power state. The driver ought to leave these tasks to the PCI subsystem
  386 +that has all of the information necessary to perform them.
183 387  
184   -pci_disable_device(dev);
  388 +A suspended device is brought back into the "active" state, or resumed,
  389 +with the help of pm_request_resume() or pm_runtime_resume() which both call
  390 +pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's
  391 +driver provides a pm->runtime_resume() callback (see below). However, before
  392 +the driver's callback is executed, pci_pm_runtime_resume() brings the device
  393 +back into the full-power state, prevents it from signaling wakeup while in that
  394 +state and restores its standard configuration registers. Thus the driver's
  395 +callback need not worry about the PCI-specific aspects of the device resume.
185 396  
186   -For devices that support the PCI PM Spec, this may be used to set the device's
187   -power state to match the suspend() parameter:
  397 +Note that generally pci_pm_runtime_resume() may be called in two different
  398 +situations. First, it may be called at the request of the device's driver, for
  399 +example if there are some data for it to process. Second, it may be called
  400 +as a result of a wakeup signal from the device itself (this sometimes is
  401 +referred to as "remote wakeup"). Of course, for this purpose the wakeup signal
  402 +is handled in one of the ways described in Section 1 and finally converted into
  403 +a notification for the PCI subsystem after the source device has been
  404 +identified.
188 405  
189   -pci_set_power_state(dev,state);
  406 +The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
  407 +and pm_request_idle(), executes the device driver's pm->runtime_idle()
  408 +callback, if defined, and if that callback doesn't return error code (or is not
  409 +present at all), suspends the device with the help of pm_runtime_suspend().
  410 +Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
  411 +example, it is called right after the device has just been resumed), in which
  412 +cases it is expected to suspend the device if that makes sense. Usually,
  413 +however, the PCI subsystem doesn't really know if the device really can be
  414 +suspended, so it lets the device's driver decide by running its
  415 +pm->runtime_idle() callback.
190 416  
191   -The driver is also responsible for disabling any other device-specific features
192   -(e.g blanking screen, turning off on-card memory, etc).
  417 +2.4. System-Wide Power Transitions
  418 +----------------------------------
  419 +There are a few different types of system-wide power transitions, described in
  420 +Documentation/power/devices.txt. Each of them requires devices to be handled
  421 +in a specific way and the PM core executes subsystem-level power management
  422 +callbacks for this purpose. They are executed in phases such that each phase
  423 +involves executing the same subsystem-level callback for every device belonging
  424 +to the given subsystem before the next phase begins. These phases always run
  425 +after tasks have been frozen.
193 426  
194   -The driver should be sure to track the current state of the device, as it may
195   -obviate the need for some operations.
  427 +2.4.1. System Suspend
196 428  
197   -The driver should update the current_state field in its pci_dev structure in
198   -this function, except for PM-capable devices when pci_set_power_state is used.
  429 +When the system is going into a sleep state in which the contents of memory will
  430 +be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
199 431  
200   -resume
201   -------
  432 + prepare, suspend, suspend_noirq.
202 433  
203   -Usage:
  434 +The following PCI bus type's callbacks, respectively, are used in these phases:
204 435  
205   -if (dev->driver && dev->driver->resume)
206   - dev->driver->resume(dev)
  436 + pci_pm_prepare()
  437 + pci_pm_suspend()
  438 + pci_pm_suspend_noirq()
207 439  
208   -The resume callback may be called from any power state, and is always meant to
209   -transition the device to the D0 state.
  440 +The pci_pm_prepare() routine first puts the device into the "fully functional"
  441 +state with the help of pm_runtime_resume(). Then, it executes the device
  442 +driver's pm->prepare() callback if defined (i.e. if the driver's struct
  443 +dev_pm_ops object is present and the prepare pointer in that object is valid).
210 444  
211   -The driver is responsible for reenabling any features of the device that had
212   -been disabled during previous suspend calls, such as IRQs and bus mastering,
213   -as well as calling pci_restore_state().
  445 +The pci_pm_suspend() routine first checks if the device's driver implements
  446 +legacy PCI suspend routines (see Section 3), in which case the driver's legacy
  447 +suspend callback is executed, if present, and its result is returned. Next, if
  448 +the device's driver doesn't provide a struct dev_pm_ops object (containing
  449 +pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
  450 +simply turns off the device's bus master capability and runs
  451 +pcibios_disable_device() to disable it, unless the device is a bridge (PCI
  452 +bridges are ignored by this routine). Next, the device driver's pm->suspend()
  453 +callback is executed, if defined, and its result is returned if it fails.
  454 +Finally, pci_fixup_device() is called to apply hardware suspend quirks related
  455 +to the device if necessary.
214 456  
215   -If the device is currently in D3, it may need to be reinitialized in resume().
  457 +Note that the suspend phase is carried out asynchronously for PCI devices, so
  458 +the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
  459 +devices that don't depend on each other in a known way (i.e. none of the paths
  460 +in the device tree from the root bridge to a leaf device contains both of them).
216 461  
217   - * Some types of devices, like bus controllers, will preserve context in D3hot
218   - (using Vcc power). Their drivers will often want to avoid re-initializing
219   - them after re-entering D0 (perhaps to avoid resetting downstream devices).
  462 +The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
  463 +been called, which means that the device driver's interrupt handler won't be
  464 +invoked while this routine is running. It first checks if the device's driver
  465 +implements legacy PCI suspends routines (Section 3), in which case the legacy
  466 +late suspend routine is called and its result is returned (the standard
  467 +configuration registers of the device are saved if the driver's callback hasn't
  468 +done that). Second, if the device driver's struct dev_pm_ops object is not
  469 +present, the device's standard configuration registers are saved and the routine
  470 +returns success. Otherwise the device driver's pm->suspend_noirq() callback is
  471 +executed, if present, and its result is returned if it fails. Next, if the
  472 +device's standard configuration registers haven't been saved yet (one of the
  473 +device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
  474 +saves them, prepares the device to signal wakeup (if necessary) and puts it into
  475 +a low-power state.
220 476  
221   - * Other kinds of devices in D3hot will discard device context as part of a
222   - soft reset when re-entering the D0 state.
223   -
224   - * Devices resuming from D3cold always go through a power-on reset. Some
225   - device context can also be preserved using Vaux power.
  477 +The low-power state to put the device into is the lowest-power (highest number)
  478 +state from which it can signal wakeup while the system is in the target sleep
  479 +state. Just like in the runtime PM case described above, the mechanism of
  480 +signaling wakeup is system-dependent and determined by the PCI subsystem, which
  481 +is also responsible for preparing the device to signal wakeup from the system's
  482 +target sleep state as appropriate.
226 483  
227   - * Some systems hide D3cold resume paths from drivers. For example, on PCs
228   - the resume path for suspend-to-disk often runs BIOS powerup code, which
229   - will sometimes re-initialize the device.
  484 +PCI device drivers (that don't implement legacy power management callbacks) are
  485 +generally not expected to prepare devices for signaling wakeup or to put them
  486 +into low-power states. However, if one of the driver's suspend callbacks
  487 +(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
  488 +registers, pci_pm_suspend_noirq() will assume that the device has been prepared
  489 +to signal wakeup and put into a low-power state by the driver (the driver is
  490 +then assumed to have used the helper functions provided by the PCI subsystem for
  491 +this purpose). PCI device drivers are not encouraged to do that, but in some
  492 +rare cases doing that in the driver may be the optimum approach.
230 493  
231   -To handle resets during D3 to D0 transitions, it may be convenient to share
232   -device initialization code between probe() and resume(). Device parameters
233   -can also be saved before the driver suspends into D3, avoiding re-probe.
  494 +2.4.2. System Resume
234 495  
235   -If the device supports the PCI PM Spec, it can use this to physically transition
236   -the device to D0:
  496 +When the system is undergoing a transition from a sleep state in which the
  497 +contents of memory have been preserved, such as one of the ACPI sleep states
  498 +S1-S3, into the working state (ACPI S0), the phases are:
237 499  
238   -pci_set_power_state(dev,0);
  500 + resume_noirq, resume, complete.
239 501  
240   -Note that if the entire system is transitioning out of a global sleep state, all
241   -devices will be placed in the D0 state, so this is not necessary. However, in
242   -the event that the device is placed in the D3 state during normal operation,
243   -this call is necessary. It is impossible to determine which of the two events is
244   -taking place in the driver, so it is always a good idea to make that call.
  502 +The following PCI bus type's callbacks, respectively, are executed in these
  503 +phases:
245 504  
246   -The driver should take note of the state that it is resuming from in order to
247   -ensure correct (and speedy) operation.
  505 + pci_pm_resume_noirq()
  506 + pci_pm_resume()
  507 + pci_pm_complete()
248 508  
249   -The driver should update the current_state field in its pci_dev structure in
250   -this function, except for PM-capable devices when pci_set_power_state is used.
  509 +The pci_pm_resume_noirq() routine first puts the device into the full-power
  510 +state, restores its standard configuration registers and applies early resume
  511 +hardware quirks related to the device, if necessary. This is done
  512 +unconditionally, regardless of whether or not the device's driver implements
  513 +legacy PCI power management callbacks (this way all PCI devices are in the
  514 +full-power state and their standard configuration registers have been restored
  515 +when their interrupt handlers are invoked for the first time during resume,
  516 +which allows the kernel to avoid problems with the handling of shared interrupts
  517 +by drivers whose devices are still suspended). If legacy PCI power management
  518 +callbacks (see Section 3) are implemented by the device's driver, the legacy
  519 +early resume callback is executed and its result is returned. Otherwise, the
  520 +device driver's pm->resume_noirq() callback is executed, if defined, and its
  521 +result is returned.
251 522  
  523 +The pci_pm_resume() routine first checks if the device's standard configuration
  524 +registers have been restored and restores them if that's not the case (this
  525 +only is necessary in the error path during a failing suspend). Next, resume
  526 +hardware quirks related to the device are applied, if necessary, and if the
  527 +device's driver implements legacy PCI power management callbacks (see
  528 +Section 3), the driver's legacy resume callback is executed and its result is
  529 +returned. Otherwise, the device's wakeup signaling mechanisms are blocked and
  530 +its driver's pm->resume() callback is executed, if defined (the callback's
  531 +result is then returned).
252 532  
  533 +The resume phase is carried out asynchronously for PCI devices, like the
  534 +suspend phase described above, which means that if two PCI devices don't depend
  535 +on each other in a known way, the pci_pm_resume() routine may be executed for
  536 +the both of them in parallel.
253 537  
254   -A reference implementation
255   --------------------------
256   -.suspend()
257   -{
258   - /* driver specific operations */
  538 +The pci_pm_complete() routine only executes the device driver's pm->complete()
  539 +callback, if defined.
259 540  
260   - /* Disable IRQ */
261   - free_irq();
262   - /* If using MSI */
263   - pci_disable_msi();
  541 +2.4.3. System Hibernation
264 542  
265   - pci_save_state();
266   - pci_enable_wake();
267   - /* Disable IO/bus master/irq router */
268   - pci_disable_device();
269   - pci_set_power_state(pci_choose_state());
270   -}
  543 +System hibernation is more complicated than system suspend, because it requires
  544 +a system image to be created and written into a persistent storage medium. The
  545 +image is created atomically and all devices are quiesced, or frozen, before that
  546 +happens.
271 547  
272   -.resume()
273   -{
274   - pci_set_power_state(PCI_D0);
275   - pci_restore_state();
276   - /* device's irq possibly is changed, driver should take care */
277   - pci_enable_device();
278   - pci_set_master();
  548 +The freezing of devices is carried out after enough memory has been freed (at
  549 +the time of this writing the image creation requires at least 50% of system RAM
  550 +to be free) in the following three phases:
279 551  
280   - /* if using MSI, device's vector possibly is changed */
281   - pci_enable_msi();
  552 + prepare, freeze, freeze_noirq
282 553  
283   - request_irq();
284   - /* driver specific operations; */
285   -}
  554 +that correspond to the PCI bus type's callbacks:
286 555  
287   -This is a typical implementation. Drivers can slightly change the order
288   -of the operations in the implementation, ignore some operations or add
289   -more driver specific operations in it, but drivers should do something like
290   -this on the whole.
  556 + pci_pm_prepare()
  557 + pci_pm_freeze()
  558 + pci_pm_freeze_noirq()
291 559  
292   -5. Resources
293   -~~~~~~~~~~~~
  560 +This means that the prepare phase is exactly the same as for system suspend.
  561 +The other two phases, however, are different.
294 562  
295   -PCI Local Bus Specification
296   -PCI Bus Power Management Interface Specification
  563 +The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
  564 +the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
  565 +and it doesn't apply the suspend-related hardware quirks. It is executed
  566 +asynchronously for different PCI devices that don't depend on each other in a
  567 +known way.
297 568  
298   - http://www.pcisig.com
  569 +The pci_pm_freeze_noirq() routine, in turn, is similar to
  570 +pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
  571 +routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the
  572 +device for signaling wakeup and put it into a low-power state. Still, it saves
  573 +the device's standard configuration registers if they haven't been saved by one
  574 +of the driver's callbacks.
  575 +
  576 +Once the image has been created, it has to be saved. However, at this point all
  577 +devices are frozen and they cannot handle I/O, while their ability to handle
  578 +I/O is obviously necessary for the image saving. Thus they have to be brought
  579 +back to the fully functional state and this is done in the following phases:
  580 +
  581 + thaw_noirq, thaw, complete
  582 +
  583 +using the following PCI bus type's callbacks:
  584 +
  585 + pci_pm_thaw_noirq()
  586 + pci_pm_thaw()
  587 + pci_pm_complete()
  588 +
  589 +respectively.
  590 +
  591 +The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(),
  592 +but it doesn't put the device into the full power state and doesn't attempt to
  593 +restore its standard configuration registers. It also executes the device
  594 +driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq().
  595 +
  596 +The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
  597 +driver's pm->thaw() callback instead of pm->resume(). It is executed
  598 +asynchronously for different PCI devices that don't depend on each other in a
  599 +known way.
  600 +
  601 +The complete phase it the same as for system resume.
  602 +
  603 +After saving the image, devices need to be powered down before the system can
  604 +enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in
  605 +three phases:
  606 +
  607 + prepare, poweroff, poweroff_noirq
  608 +
  609 +where the prepare phase is exactly the same as for system suspend. The other
  610 +two phases are analogous to the suspend and suspend_noirq phases, respectively.
  611 +The PCI subsystem-level callbacks they correspond to
  612 +
  613 + pci_pm_poweroff()
  614 + pci_pm_poweroff_noirq()
  615 +
  616 +work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
  617 +although they don't attempt to save the device's standard configuration
  618 +registers.
  619 +
  620 +2.4.4. System Restore
  621 +
  622 +System restore requires a hibernation image to be loaded into memory and the
  623 +pre-hibernation memory contents to be restored before the pre-hibernation system
  624 +activity can be resumed.
  625 +
  626 +As described in Documentation/power/devices.txt, the hibernation image is loaded
  627 +into memory by a fresh instance of the kernel, called the boot kernel, which in
  628 +turn is loaded and run by a boot loader in the usual way. After the boot kernel
  629 +has loaded the image, it needs to replace its own code and data with the code
  630 +and data of the "hibernated" kernel stored within the image, called the image
  631 +kernel. For this purpose all devices are frozen just like before creating
  632 +the image during hibernation, in the
  633 +
  634 + prepare, freeze, freeze_noirq
  635 +
  636 +phases described above. However, the devices affected by these phases are only
  637 +those having drivers in the boot kernel; other devices will still be in whatever
  638 +state the boot loader left them.
  639 +
  640 +Should the restoration of the pre-hibernation memory contents fail, the boot
  641 +kernel would go through the "thawing" procedure described above, using the
  642 +thaw_noirq, thaw, and complete phases (that will only affect the devices having
  643 +drivers in the boot kernel), and then continue running normally.
  644 +
  645 +If the pre-hibernation memory contents are restored successfully, which is the
  646 +usual situation, control is passed to the image kernel, which then becomes
  647 +responsible for bringing the system back to the working state. To achieve this,
  648 +it must restore the devices' pre-hibernation functionality, which is done much
  649 +like waking up from the memory sleep state, although it involves different
  650 +phases:
  651 +
  652 + restore_noirq, restore, complete
  653 +
  654 +The first two of these are analogous to the resume_noirq and resume phases
  655 +described above, respectively, and correspond to the following PCI subsystem
  656 +callbacks:
  657 +
  658 + pci_pm_restore_noirq()
  659 + pci_pm_restore()
  660 +
  661 +These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
  662 +respectively, but they execute the device driver's pm->restore_noirq() and
  663 +pm->restore() callbacks, if available.
  664 +
  665 +The complete phase is carried out in exactly the same way as during system
  666 +resume.
  667 +
  668 +
  669 +3. PCI Device Drivers and Power Management
  670 +==========================================
  671 +
  672 +3.1. Power Management Callbacks
  673 +-------------------------------
  674 +PCI device drivers participate in power management by providing callbacks to be
  675 +executed by the PCI subsystem's power management routines described above and by
  676 +controlling the runtime power management of their devices.
  677 +
  678 +At the time of this writing there are two ways to define power management
  679 +callbacks for a PCI device driver, the recommended one, based on using a
  680 +dev_pm_ops structure described in Documentation/power/devices.txt, and the
  681 +"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
  682 +.resume() callbacks from struct pci_driver are used. The legacy approach,
  683 +however, doesn't allow one to define runtime power management callbacks and is
  684 +not really suitable for any new drivers. Therefore it is not covered by this
  685 +document (refer to the source code to learn more about it).
  686 +
  687 +It is recommended that all PCI device drivers define a struct dev_pm_ops object
  688 +containing pointers to power management (PM) callbacks that will be executed by
  689 +the PCI subsystem's PM routines in various circumstances. A pointer to the
  690 +driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
  691 +its struct pci_driver object. Once that has happened, the "legacy" PM callbacks
  692 +in struct pci_driver are ignored (even if they are not NULL).
  693 +
  694 +The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
  695 +defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
  696 +subsystem will handle the device in a simplified default manner. If they are
  697 +defined, though, they are expected to behave as described in the following
  698 +subsections.
  699 +
  700 +3.1.1. prepare()
  701 +
  702 +The prepare() callback is executed during system suspend, during hibernation
  703 +(when a hibernation image is about to be created), during power-off after
  704 +saving a hibernation image and during system restore, when a hibernation image
  705 +has just been loaded into memory.
  706 +
  707 +This callback is only necessary if the driver's device has children that in
  708 +general may be registered at any time. In that case the role of the prepare()
  709 +callback is to prevent new children of the device from being registered until
  710 +one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
  711 +
  712 +In addition to that the prepare() callback may carry out some operations
  713 +preparing the device to be suspended, although it should not allocate memory
  714 +(if additional memory is required to suspend the device, it has to be
  715 +preallocated earlier, for example in a suspend/hibernate notifier as described
  716 +in Documentation/power/notifiers.txt).
  717 +
  718 +3.1.2. suspend()
  719 +
  720 +The suspend() callback is only executed during system suspend, after prepare()
  721 +callbacks have been executed for all devices in the system.
  722 +
  723 +This callback is expected to quiesce the device and prepare it to be put into a
  724 +low-power state by the PCI subsystem. It is not required (in fact it even is
  725 +not recommended) that a PCI driver's suspend() callback save the standard
  726 +configuration registers of the device, prepare it for waking up the system, or
  727 +put it into a low-power state. All of these operations can very well be taken
  728 +care of by the PCI subsystem, without the driver's participation.
  729 +
  730 +However, in some rare case it is convenient to carry out these operations in
  731 +a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and
  732 +pci_set_power_state() should be used to save the device's standard configuration
  733 +registers, to prepare it for system wakeup (if necessary), and to put it into a
  734 +low-power state, respectively. Moreover, if the driver calls pci_save_state(),
  735 +the PCI subsystem will not execute either pci_prepare_to_sleep(), or
  736 +pci_set_power_state() for its device, so the driver is then responsible for
  737 +handling the device as appropriate.
  738 +
  739 +While the suspend() callback is being executed, the driver's interrupt handler
  740 +can be invoked to handle an interrupt from the device, so all suspend-related
  741 +operations relying on the driver's ability to handle interrupts should be
  742 +carried out in this callback.
  743 +
  744 +3.1.3. suspend_noirq()
  745 +
  746 +The suspend_noirq() callback is only executed during system suspend, after
  747 +suspend() callbacks have been executed for all devices in the system and
  748 +after device interrupts have been disabled by the PM core.
  749 +
  750 +The difference between suspend_noirq() and suspend() is that the driver's
  751 +interrupt handler will not be invoked while suspend_noirq() is running. Thus
  752 +suspend_noirq() can carry out operations that would cause race conditions to
  753 +arise if they were performed in suspend().
  754 +
  755 +3.1.4. freeze()
  756 +
  757 +The freeze() callback is hibernation-specific and is executed in two situations,
  758 +during hibernation, after prepare() callbacks have been executed for all devices
  759 +in preparation for the creation of a system image, and during restore,
  760 +after a system image has been loaded into memory from persistent storage and the
  761 +prepare() callbacks have been executed for all devices.
  762 +
  763 +The role of this callback is analogous to the role of the suspend() callback
  764 +described above. In fact, they only need to be different in the rare cases when
  765 +the driver takes the responsibility for putting the device into a low-power
  766 +state.
  767 +
  768 +In that cases the freeze() callback should not prepare the device system wakeup
  769 +or put it into a low-power state. Still, either it or freeze_noirq() should
  770 +save the device's standard configuration registers using pci_save_state().
  771 +
  772 +3.1.5. freeze_noirq()
  773 +
  774 +The freeze_noirq() callback is hibernation-specific. It is executed during
  775 +hibernation, after prepare() and freeze() callbacks have been executed for all
  776 +devices in preparation for the creation of a system image, and during restore,
  777 +after a system image has been loaded into memory and after prepare() and
  778 +freeze() callbacks have been executed for all devices. It is always executed
  779 +after device interrupts have been disabled by the PM core.
  780 +
  781 +The role of this callback is analogous to the role of the suspend_noirq()
  782 +callback described above and it very rarely is necessary to define
  783 +freeze_noirq().
  784 +
  785 +The difference between freeze_noirq() and freeze() is analogous to the
  786 +difference between suspend_noirq() and suspend().
  787 +
  788 +3.1.6. poweroff()
  789 +
  790 +The poweroff() callback is hibernation-specific. It is executed when the system
  791 +is about to be powered off after saving a hibernation image to a persistent
  792 +storage. prepare() callbacks are executed for all devices before poweroff() is
  793 +called.
  794 +
  795 +The role of this callback is analogous to the role of the suspend() and freeze()
  796 +callbacks described above, although it does not need to save the contents of
  797 +the device's registers. In particular, if the driver wants to put the device
  798 +into a low-power state itself instead of allowing the PCI subsystem to do that,
  799 +the poweroff() callback should use pci_prepare_to_sleep() and
  800 +pci_set_power_state() to prepare the device for system wakeup and to put it
  801 +into a low-power state, respectively, but it need not save the device's standard
  802 +configuration registers.
  803 +
  804 +3.1.7. poweroff_noirq()
  805 +
  806 +The poweroff_noirq() callback is hibernation-specific. It is executed after
  807 +poweroff() callbacks have been executed for all devices in the system.
  808 +
  809 +The role of this callback is analogous to the role of the suspend_noirq() and
  810 +freeze_noirq() callbacks described above, but it does not need to save the
  811 +contents of the device's registers.
  812 +
  813 +The difference between poweroff_noirq() and poweroff() is analogous to the
  814 +difference between suspend_noirq() and suspend().
  815 +
  816 +3.1.8. resume_noirq()
  817 +
  818 +The resume_noirq() callback is only executed during system resume, after the
  819 +PM core has enabled the non-boot CPUs. The driver's interrupt handler will not
  820 +be invoked while resume_noirq() is running, so this callback can carry out
  821 +operations that might race with the interrupt handler.
  822 +
  823 +Since the PCI subsystem unconditionally puts all devices into the full power
  824 +state in the resume_noirq phase of system resume and restores their standard
  825 +configuration registers, resume_noirq() is usually not necessary. In general
  826 +it should only be used for performing operations that would lead to race
  827 +conditions if carried out by resume().
  828 +
  829 +3.1.9. resume()
  830 +
  831 +The resume() callback is only executed during system resume, after
  832 +resume_noirq() callbacks have been executed for all devices in the system and
  833 +device interrupts have been enabled by the PM core.
  834 +
  835 +This callback is responsible for restoring the pre-suspend configuration of the
  836 +device and bringing it back to the fully functional state. The device should be
  837 +able to process I/O in a usual way after resume() has returned.
  838 +
  839 +3.1.10. thaw_noirq()
  840 +
  841 +The thaw_noirq() callback is hibernation-specific. It is executed after a
  842 +system image has been created and the non-boot CPUs have been enabled by the PM
  843 +core, in the thaw_noirq phase of hibernation. It also may be executed if the
  844 +loading of a hibernation image fails during system restore (it is then executed
  845 +after enabling the non-boot CPUs). The driver's interrupt handler will not be
  846 +invoked while thaw_noirq() is running.
  847 +
  848 +The role of this callback is analogous to the role of resume_noirq(). The
  849 +difference between these two callbacks is that thaw_noirq() is executed after
  850 +freeze() and freeze_noirq(), so in general it does not need to modify the
  851 +contents of the device's registers.
  852 +
  853 +3.1.11. thaw()
  854 +
  855 +The thaw() callback is hibernation-specific. It is executed after thaw_noirq()
  856 +callbacks have been executed for all devices in the system and after device
  857 +interrupts have been enabled by the PM core.
  858 +
  859 +This callback is responsible for restoring the pre-freeze configuration of
  860 +the device, so that it will work in a usual way after thaw() has returned.
  861 +
  862 +3.1.12. restore_noirq()
  863 +
  864 +The restore_noirq() callback is hibernation-specific. It is executed in the
  865 +restore_noirq phase of hibernation, when the boot kernel has passed control to
  866 +the image kernel and the non-boot CPUs have been enabled by the image kernel's
  867 +PM core.
  868 +
  869 +This callback is analogous to resume_noirq() with the exception that it cannot
  870 +make any assumption on the previous state of the device, even if the BIOS (or
  871 +generally the platform firmware) is known to preserve that state over a
  872 +suspend-resume cycle.
  873 +
  874 +For the vast majority of PCI device drivers there is no difference between
  875 +resume_noirq() and restore_noirq().
  876 +
  877 +3.1.13. restore()
  878 +
  879 +The restore() callback is hibernation-specific. It is executed after
  880 +restore_noirq() callbacks have been executed for all devices in the system and
  881 +after the PM core has enabled device drivers' interrupt handlers to be invoked.
  882 +
  883 +This callback is analogous to resume(), just like restore_noirq() is analogous
  884 +to resume_noirq(). Consequently, the difference between restore_noirq() and
  885 +restore() is analogous to the difference between resume_noirq() and resume().
  886 +
  887 +For the vast majority of PCI device drivers there is no difference between
  888 +resume() and restore().
  889 +
  890 +3.1.14. complete()
  891 +
  892 +The complete() callback is executed in the following situations:
  893 + - during system resume, after resume() callbacks have been executed for all
  894 + devices,
  895 + - during hibernation, before saving the system image, after thaw() callbacks
  896 + have been executed for all devices,
  897 + - during system restore, when the system is going back to its pre-hibernation
  898 + state, after restore() callbacks have been executed for all devices.
  899 +It also may be executed if the loading of a hibernation image into memory fails
  900 +(in that case it is run after thaw() callbacks have been executed for all
  901 +devices that have drivers in the boot kernel).
  902 +
  903 +This callback is entirely optional, although it may be necessary if the
  904 +prepare() callback performs operations that need to be reversed.
  905 +
  906 +3.1.15. runtime_suspend()
  907 +
  908 +The runtime_suspend() callback is specific to device runtime power management
  909 +(runtime PM). It is executed by the PM core's runtime PM framework when the
  910 +device is about to be suspended (i.e. quiesced and put into a low-power state)
  911 +at run time.
  912 +
  913 +This callback is responsible for freezing the device and preparing it to be
  914 +put into a low-power state, but it must allow the PCI subsystem to perform all
  915 +of the PCI-specific actions necessary for suspending the device.
  916 +
  917 +3.1.16. runtime_resume()
  918 +
  919 +The runtime_resume() callback is specific to device runtime PM. It is executed
  920 +by the PM core's runtime PM framework when the device is about to be resumed
  921 +(i.e. put into the full-power state and programmed to process I/O normally) at
  922 +run time.
  923 +
  924 +This callback is responsible for restoring the normal functionality of the
  925 +device after it has been put into the full-power state by the PCI subsystem.
  926 +The device is expected to be able to process I/O in the usual way after
  927 +runtime_resume() has returned.
  928 +
  929 +3.1.17. runtime_idle()
  930 +
  931 +The runtime_idle() callback is specific to device runtime PM. It is executed
  932 +by the PM core's runtime PM framework whenever it may be desirable to suspend
  933 +the device according to the PM core's information. In particular, it is
  934 +automatically executed right after runtime_resume() has returned in case the
  935 +resume of the device has happened as a result of a spurious event.
  936 +
  937 +This callback is optional, but if it is not implemented or if it returns 0, the
  938 +PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
  939 +cause the driver's runtime_suspend() callback to be executed.
  940 +
  941 +3.1.18. Pointing Multiple Callback Pointers to One Routine
  942 +
  943 +Although in principle each of the callbacks described in the previous
  944 +subsections can be defined as a separate function, it often is convenient to
  945 +point two or more members of struct dev_pm_ops to the same routine. There are
  946 +a few convenience macros that can be used for this purpose.
  947 +
  948 +The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
  949 +suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
  950 +members and one resume routine pointed to by the .resume(), .thaw(), and
  951 +.restore() members. The other function pointers in this struct dev_pm_ops are
  952 +unset.
  953 +
  954 +The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
  955 +additionally sets the .runtime_resume() pointer to the same value as
  956 +.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
  957 +the same value as .suspend() (and .freeze() and .poweroff()).
  958 +
  959 +The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
  960 +dev_pm_ops to indicate that one suspend routine is to be pointed to by the
  961 +.suspend(), .freeze(), and .poweroff() members and one resume routine is to
  962 +be pointed to by the .resume(), .thaw(), and .restore() members.
  963 +
  964 +3.2. Device Runtime Power Management
  965 +------------------------------------
  966 +In addition to providing device power management callbacks PCI device drivers
  967 +are responsible for controlling the runtime power management (runtime PM) of
  968 +their devices.
  969 +
  970 +The PCI device runtime PM is optional, but it is recommended that PCI device
  971 +drivers implement it at least in the cases where there is a reliable way of
  972 +verifying that the device is not used (like when the network cable is detached
  973 +from an Ethernet adapter or there are no devices attached to a USB controller).
  974 +
  975 +To support the PCI runtime PM the driver first needs to implement the
  976 +runtime_suspend() and runtime_resume() callbacks. It also may need to implement
  977 +the runtime_idle() callback to prevent the device from being suspended again
  978 +every time right after the runtime_resume() callback has returned
  979 +(alternatively, the runtime_suspend() callback will have to check if the
  980 +device should really be suspended and return -EAGAIN if that is not the case).
  981 +
  982 +The runtime PM of PCI devices is disabled by default. It is also blocked by
  983 +pci_pm_init() that runs the pm_runtime_forbid() helper function. If a PCI
  984 +driver implements the runtime PM callbacks and intends to use the runtime PM
  985 +framework provided by the PM core and the PCI subsystem, it should enable this
  986 +feature by executing the pm_runtime_enable() helper function. However, the
  987 +driver should not call the pm_runtime_allow() helper function unblocking
  988 +the runtime PM of the device. Instead, it should allow user space or some
  989 +platform-specific code to do that (user space can do it via sysfs), although
  990 +once it has called pm_runtime_enable(), it must be prepared to handle the
  991 +runtime PM of the device correctly as soon as pm_runtime_allow() is called
  992 +(which may happen at any time). [It also is possible that user space causes
  993 +pm_runtime_allow() to be called via sysfs before the driver is loaded, so in
  994 +fact the driver has to be prepared to handle the runtime PM of the device as
  995 +soon as it calls pm_runtime_enable().]
  996 +
  997 +The runtime PM framework works by processing requests to suspend or resume
  998 +devices, or to check if they are idle (in which cases it is reasonable to
  999 +subsequently request that they be suspended). These requests are represented
  1000 +by work items put into the power management workqueue, pm_wq. Although there
  1001 +are a few situations in which power management requests are automatically
  1002 +queued by the PM core (for example, after processing a request to resume a
  1003 +device the PM core automatically queues a request to check if the device is
  1004 +idle), device drivers are generally responsible for queuing power management
  1005 +requests for their devices. For this purpose they should use the runtime PM
  1006 +helper functions provided by the PM core, discussed in
  1007 +Documentation/power/runtime_pm.txt.
  1008 +
  1009 +Devices can also be suspended and resumed synchronously, without placing a
  1010 +request into pm_wq. In the majority of cases this also is done by their
  1011 +drivers that use helper functions provided by the PM core for this purpose.
  1012 +
  1013 +For more information on the runtime PM of devices refer to
  1014 +Documentation/power/runtime_pm.txt.
  1015 +
  1016 +
  1017 +4. Resources
  1018 +============
  1019 +
  1020 +PCI Local Bus Specification, Rev. 3.0
  1021 +PCI Bus Power Management Interface Specification, Rev. 1.2
  1022 +Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
  1023 +PCI Express Base Specification, Rev. 2.0
  1024 +Documentation/power/devices.txt
  1025 +Documentation/power/runtime_pm.txt