01 Nov, 2017

1 commit

  • There is a particular situation when the cooling device is cpufreq and the heat
    dissipation is not efficient enough where the temperature increases little by
    little until reaching the critical threshold and leading to a SoC reset.

    The behavior is reproducible on a hikey6220 with bad heat dissipation (eg.
    stacked with other boards).

    Running a simple C program doing while(1); for each CPU of the SoC makes the
    temperature to reach the passive regulation trip point and ends up to the
    maximum allowed temperature followed by a reset.

    This issue has been also reported by running the libhugetlbfs test suite.

    What is observed is a ping pong between two cpu frequencies, 1.2GHz and 900MHz
    while the temperature continues to grow.

    It appears the step wise governor calls get_target_state() the first time with
    the throttle set to true and the trend to 'raising'. The code selects logically
    the next state, so the cpu frequency decreases from 1.2GHz to 900MHz, so far so
    good. The temperature decreases immediately but still stays greater than the
    trip point, then get_target_state() is called again, this time with the
    throttle set to true *and* the trend to 'dropping'. From there the algorithm
    assumes we have to step down the state and the cpu frequency jumps back to
    1.2GHz. But the temperature is still higher than the trip point, so
    get_target_state() is called with throttle=1 and trend='raising' again, we jump
    to 900MHz, then get_target_state() is called with throttle=1 and
    trend='dropping', we jump to 1.2GHz, etc ... but the temperature does not
    stabilizes and continues to increase.

    [ 237.922654] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
    [ 237.922678] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
    [ 237.922690] thermal cooling_device0: cur_state=0
    [ 237.922701] thermal cooling_device0: old_target=0, target=1
    [ 238.026656] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
    [ 238.026680] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=1
    [ 238.026694] thermal cooling_device0: cur_state=1
    [ 238.026707] thermal cooling_device0: old_target=1, target=0
    [ 238.134647] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
    [ 238.134667] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
    [ 238.134679] thermal cooling_device0: cur_state=0
    [ 238.134690] thermal cooling_device0: old_target=0, target=1

    In this situation the temperature continues to increase while the trend is
    oscillating between 'dropping' and 'raising'. We need to keep the current state
    untouched if the throttle is set, so the temperature can decrease or a higher
    state could be selected, thus preventing this oscillation.

    Keeping the next_target untouched when 'throttle' is true at 'dropping' time
    fixes the issue.

    The following traces show the governor does not change the next state if
    trend==2 (dropping) and throttle==1.

    [ 2306.127987] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
    [ 2306.128009] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
    [ 2306.128021] thermal cooling_device0: cur_state=0
    [ 2306.128031] thermal cooling_device0: old_target=0, target=1
    [ 2306.231991] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
    [ 2306.232016] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=1
    [ 2306.232030] thermal cooling_device0: cur_state=1
    [ 2306.232042] thermal cooling_device0: old_target=1, target=1
    [ 2306.335982] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
    [ 2306.336006] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=1
    [ 2306.336021] thermal cooling_device0: cur_state=1
    [ 2306.336034] thermal cooling_device0: old_target=1, target=1
    [ 2306.439984] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
    [ 2306.440008] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=0
    [ 2306.440022] thermal cooling_device0: cur_state=1
    [ 2306.440034] thermal cooling_device0: old_target=1, target=0

    [ ... ]

    After a while, if the temperature continues to increase, the next state becomes
    2 which is 720MHz on the hikey. That results in the temperature stabilizing
    around the trip point.

    [ 2455.831982] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
    [ 2455.832006] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=0
    [ 2455.832019] thermal cooling_device0: cur_state=1
    [ 2455.832032] thermal cooling_device0: old_target=1, target=1
    [ 2455.935985] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
    [ 2455.936013] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=0
    [ 2455.936027] thermal cooling_device0: cur_state=1
    [ 2455.936040] thermal cooling_device0: old_target=1, target=1
    [ 2456.043984] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
    [ 2456.044009] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=0
    [ 2456.044023] thermal cooling_device0: cur_state=1
    [ 2456.044036] thermal cooling_device0: old_target=1, target=1
    [ 2456.148001] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
    [ 2456.148028] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
    [ 2456.148042] thermal cooling_device0: cur_state=1
    [ 2456.148055] thermal cooling_device0: old_target=1, target=2
    [ 2456.252009] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
    [ 2456.252041] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=0
    [ 2456.252058] thermal cooling_device0: cur_state=2
    [ 2456.252075] thermal cooling_device0: old_target=2, target=1

    IOW, this change is needed to keep the state for a cooling device if the
    temperature trend is oscillating while the temperature increases slightly.

    Without this change, the situation above leads to a catastrophic crash by a
    hardware reset on hikey. This issue has been reported to happen on an OMAP
    dra7xx also.

    Signed-off-by: Daniel Lezcano
    Cc: Keerthy
    Cc: John Stultz
    Cc: Leo Yan
    Tested-by: Keerthy
    Reviewed-by: Keerthy
    Signed-off-by: Eduardo Valentin

    Daniel Lezcano
     

29 Jun, 2017

1 commit


08 Aug, 2016

1 commit

  • When multiple thermal zones are bound to the same cooling device, multiple
    kernel threads may want to update the cooling device state by calling
    thermal_cdev_update(). Having cdev not protected by a mutex can lead to a race
    condition. Consider the following situation with two kernel threads k1 and k2:

    Thread k1 Thread k2
    ||
    || call thermal_cdev_update()
    || ...
    || set_cur_state(cdev, target);
    call power_actor_set_power() ||
    ... ||
    instance->target = state; ||
    cdev->updated = false; ||
    || cdev->updated = true;
    || // completes execution
    call thermal_cdev_update() ||
    // cdev->updated == true ||
    return; ||
    \/
    time

    k2 has already looped through the thermal instances looking for the deepest
    cooling device state and is preempted right before setting cdev->updated to
    true. Now, k1 runs, modifies the thermal instance state and sets cdev->updated
    to false. Then, k1 is preempted and k2 continues the execution by setting
    cdev->updated to true, therefore preventing k1 from performing the update.
    Notice that this is not an issue if k2 looks at the instance->target modified by
    k1 "after" it is assigned by k1. In fact, in this case the update will happen
    anyway and k1 can safely return immediately from thermal_cdev_update().

    This may lead to a situation where a thermal governor never updates the cooling
    device. For example, this is the case for the step_wise governor: when calling
    the function thermal_zone_trip_update(), the governor may always get a new state
    equal to the old one (which, however, wasn't notified to the cooling device) and
    will therefore skip the update.

    CC: Zhang Rui
    CC: Eduardo Valentin
    CC: Peter Feuerer
    Reported-by: Toby Huang
    Signed-off-by: Michele Di Giorgio
    Reviewed-by: Javi Merino
    Signed-off-by: Zhang Rui

    Michele Di Giorgio
     

29 Dec, 2015

1 commit

  • After thermal zone device registered, as we have not read any
    temperature before, thus tz->temperature should not be 0,
    which actually means 0C, and thermal trend is not available.
    In this case, we need specially handling for the first
    thermal_zone_device_update().

    Both thermal core framework and step_wise governor is
    enhanced to handle this. And since the step_wise governor
    is the only one that uses trends, so it's the only thermal
    governor that needs to be updated.

    CC: #3.18+
    Tested-by: Manuel Krause
    Tested-by: szegad
    Tested-by: prash
    Tested-by: amish
    Tested-by: Matthias
    Reviewed-by: Javi Merino
    Signed-off-by: Zhang Rui
    Signed-off-by: Chen Yu

    Zhang Rui
     

03 Aug, 2015

1 commit

  • The thermal code uses int, long and unsigned long for temperatures
    in different places.

    Using an unsigned type limits the thermal framework to positive
    temperatures without need. Also several drivers currently will report
    temperatures near UINT_MAX for temperatures below 0°C. This will probably
    immediately shut the machine down due to overtemperature if started below
    0°C.

    'long' is 64bit on several architectures. This is not needed since INT_MAX °mC
    is above the melting point of all known materials.

    Consistently use a plain 'int' for temperatures throughout the thermal code and
    the drivers. This only changes the places in the drivers where the temperature
    is passed around as pointer, when drivers internally use another type this is
    not changed.

    Signed-off-by: Sascha Hauer
    Acked-by: Geert Uytterhoeven
    Reviewed-by: Jean Delvare
    Reviewed-by: Lukasz Majewski
    Reviewed-by: Darren Hart
    Reviewed-by: Heiko Stuebner
    Reviewed-by: Peter Feuerer
    Cc: Punit Agrawal
    Cc: Zhang Rui
    Cc: Eduardo Valentin
    Cc: linux-pm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: Jean Delvare
    Cc: Peter Feuerer
    Cc: Heiko Stuebner
    Cc: Lukasz Majewski
    Cc: Stephen Warren
    Cc: Thierry Reding
    Cc: linux-acpi@vger.kernel.org
    Cc: platform-driver-x86@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-omap@vger.kernel.org
    Cc: linux-samsung-soc@vger.kernel.org
    Cc: Guenter Roeck
    Cc: Rafael J. Wysocki
    Cc: Maxime Ripard
    Cc: Darren Hart
    Cc: lm-sensors@lm-sensors.org
    Signed-off-by: Zhang Rui

    Sascha Hauer
     

06 Feb, 2015

1 commit


11 Oct, 2014

1 commit


09 Oct, 2014

1 commit

  • It turns out that some boards can have instance->lower greater than 0 and
    when thermal trend is dropping it results with next_target equal to -1.

    Since the next_target is defined as unsigned long it is interpreted as
    0xFFFFFFFF and larger than instance->upper.
    As a result the next_target is set to instance->upper which ramps up to
    maximal cooling device target when the temperature is steadily decreasing.

    Signed-off-by: Lukasz Majewski
    Signed-off-by: Zhang Rui

    Lukasz Majewski
     

29 Jul, 2014

1 commit


02 Jan, 2014

1 commit

  • To ease debugging thermal problem, add these dynamic debug statements
    so that user do not need rebuild kernel to see these info.

    Based on a patch from Zhang Rui for debugging on bugzilla:
    https://bugzilla.kernel.org/attachment.cgi?id=98671

    A sample output after we turn on dynamic debug with the following cmd:
    # echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control
    is like:

    [ 355.147627] update_temperature: thermal thermal_zone0: last_temperature=52000, current_temperature=55000
    [ 355.147636] thermal_zone_trip_update: thermal thermal_zone0: Trip1[type=1,temp=79000]:trend=2,throttle=0
    [ 355.147644] get_target_state: thermal cooling_device8: cur_state=0
    [ 355.147647] thermal_zone_trip_update: thermal cooling_device8: old_target=-1, target=-1
    [ 355.147652] get_target_state: thermal cooling_device7: cur_state=0
    [ 355.147655] thermal_zone_trip_update: thermal cooling_device7: old_target=-1, target=-1
    [ 355.147660] get_target_state: thermal cooling_device6: cur_state=0
    [ 355.147663] thermal_zone_trip_update: thermal cooling_device6: old_target=-1, target=-1
    [ 355.147668] get_target_state: thermal cooling_device5: cur_state=0
    [ 355.147671] thermal_zone_trip_update: thermal cooling_device5: old_target=-1, target=-1
    [ 355.147678] thermal_zone_trip_update: thermal thermal_zone0: Trip2[type=0,temp=90000]:trend=1,throttle=0
    [ 355.147776] get_target_state: thermal cooling_device0: cur_state=0
    [ 355.147783] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1
    [ 355.147792] thermal_zone_trip_update: thermal thermal_zone0: Trip3[type=0,temp=80000]:trend=1,throttle=0
    [ 355.147845] get_target_state: thermal cooling_device1: cur_state=0
    [ 355.147849] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1
    [ 355.147856] thermal_zone_trip_update: thermal thermal_zone0: Trip4[type=0,temp=70000]:trend=1,throttle=0
    [ 355.147904] get_target_state: thermal cooling_device2: cur_state=0
    [ 355.147908] thermal_zone_trip_update: thermal cooling_device2: old_target=-1, target=-1
    [ 355.147915] thermal_zone_trip_update: thermal thermal_zone0: Trip5[type=0,temp=60000]:trend=1,throttle=0
    [ 355.147963] get_target_state: thermal cooling_device3: cur_state=0
    [ 355.147967] thermal_zone_trip_update: thermal cooling_device3: old_target=-1, target=-1
    [ 355.147973] thermal_zone_trip_update: thermal thermal_zone0: Trip6[type=0,temp=55000]:trend=1,throttle=1
    [ 355.148022] get_target_state: thermal cooling_device4: cur_state=0
    [ 355.148025] thermal_zone_trip_update: thermal cooling_device4: old_target=-1, target=1
    [ 355.148036] thermal_cdev_update: thermal cooling_device4: zone0->target=1
    [ 355.169279] thermal_cdev_update: thermal cooling_device4: set to state 1

    Signed-off-by: Aaron Lu
    Acked-by: Eduardo Valentin
    Signed-off-by: Zhang Rui

    Aaron Lu
     

15 Aug, 2013

2 commits

  • In case the trend is not changing or when there is no
    request for throttling, it is expected that the instance
    would not change its requested target. This patch improves
    the code implementation to cover for this expected behavior.

    With current implementation, the instance will always
    reset to cdev.cur_state, even in not expected cases,
    like those mentioned above.

    This patch changes the step_wise governor implementation
    of get_target so that we accomplish:
    (a) - default value will be current instance->target, so
    we do not change the thermal instance target unnecessarily.
    (b) - the code now it is clear about what is the intention.
    There is a clear statement of what are the expected outcomes
    (c) - removal of hardcoded constants, now it is put in use
    the THERMAL_NO_TARGET macro.
    (d) - variable names are also improved so that reader can
    clearly understand the difference between instance cur target,
    next target and cdev cur_state.

    Cc: Zhang Rui
    Cc: Durgadoss R
    Cc: linux-pm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Reported-by: Ruslan Ruslichenko
    Signed-of-by: Eduardo Valentin
    Signed-off-by: Zhang Rui

    Eduardo Valentin
     
  • The cooling device only needs update on a new target state. Since we
    already check old target in thermal_zone_trip_update(), we can do one
    more check to see if it's a new target state. If not, we can reasonably
    save some uncecesary code execution.

    Signed-off-by: Shawn Guo
    Acked-by: Eduardo Valentin
    Signed-off-by: Zhang Rui

    Shawn Guo
     

14 Apr, 2013

1 commit

  • The thermal governors are part of the thermal framework,
    rather than a seperate feature/module.
    Because the generic thermal layer can not work without
    thermal governors, and it must load the thermal governors
    during its initialization.

    Build them into one module in this patch.

    This also fix a problem that the generic thermal layer does not
    work when CONFIG_THERMAL=m and CONFIG_THERMAL_GOV_XXX=y.

    Signed-off-by: Zhang Rui
    Acked-by: Eduardo Valentin
    Acked-by: Durgadoss R

    Zhang Rui
     

12 Apr, 2013

1 commit

  • When selecting a target cooling state in get_target_state(), make sure
    that the state is at least as high as the minimum when the temperature
    is rising and at least as low as the maximum when the temperature is
    falling. This is necessary because, in the THREAML_TREND_RAISING and
    THERMAL_TREND_DROPPING cases, the current state may only be incremented
    or decremented by one even if it is outside the bounds of the thermal
    instance. This might occur, for example, if the CPU is heating up
    and hits a thermal trip point for the first time when it's frequency
    is much higher than the range specified by the thermal instance
    corresponding to the trip point.

    Signed-off-by: Andrew Bresticker
    Acked-by: Eduardo Valentin
    Signed-off-by: Zhang Rui

    Andrew Bresticker
     

04 Jan, 2013

2 commits


12 Dec, 2012

1 commit


05 Nov, 2012

2 commits

  • Fixes the following sparse warnings:
    drivers/thermal/step_wise.c:153:5: warning:
    symbol 'step_wise_throttle' was not declared. Should it be static?
    drivers/thermal/step_wise.c:172:25: warning:
    symbol 'thermal_gov_step_wise' was not declared. Should it be static?

    Signed-off-by: Sachin Kamat
    Acked-by: Durgadoss R
    Signed-off-by: Zhang Rui

    Sachin Kamat
     
  • This patch adds a simple step_wise governor to the
    generic thermal layer. This algorithm throttles the
    cooling devices in a linear fashion. If the 'trend'
    is heating, it throttles by one step. And if the
    thermal trend is cooling it de-throttles by one step.

    This actually moves the throttling logic from thermal_sys.c
    and puts inside step_wise.c, without any change.

    Signed-off-by: Durgadoss R
    Signed-off-by: Zhang Rui

    Durgadoss R