Overheating causing system shutdowns

06-17-2010

Registered User

68, 1

Join Date: Jan 2010

Last Activity: 5 October 2014, 1:47 AM EDT

Posts: 68

Thanks Given: 0

Thanked 1 Time in 1 Post

Overheating causing system shutdowns

Hello,

I have a Toshiba Satellite A505-S6965. My hardware is as attached via the Hardinfo report in pdf format.

Here is the most recent kernel error. It never says it is shutting down. It just "dies" even after saying the temp/speed is normal.

Code:

Jun 16 19:28:00 localhost kernel: [  845.153389] CPU0: Temperature above threshold, cpu clock throttled (total events = 1)
Jun 16 19:28:00 localhost kernel: [  845.153394] Disabling lock debugging due to kernel taint
Jun 16 19:28:00 localhost kernel: [  845.155063] CPU0: Temperature/speed normal
Jun 16 19:28:55 localhost kernel: [  900.040028] Machine check events logged
Jun 16 19:29:37 localhost kernel: [  942.068464] CPU1: Temperature above threshold, cpu clock throttled (total events = 1)
Jun 16 19:29:37 localhost kernel: [  942.070516] CPU1: Temperature/speed normal
Jun 16 19:31:25 localhost kernel: [ 1050.040028] Machine check events logged
Jun 16 19:36:33 localhost kernel: [ 1358.494903] CPU0: Temperature above threshold, cpu clock throttled (total events = 12)
Jun 16 19:36:33 localhost kernel: [ 1358.494926] CPU1: Temperature above threshold, cpu clock throttled (total events = 67)
Jun 16 19:36:33 localhost kernel: [ 1358.498717] CPU0: Temperature/speed normal
Jun 16 19:36:33 localhost kernel: [ 1358.498740] CPU1: Temperature/speed normal
Jun 16 22:41:54 localhost kernel: imklog 4.2.0, log source = /var/run/rsyslog/kmsg started.

From earlier in the day, I get this:

Code:

Jun 16 12:17:22 localhost kernel: [177322.466688] CPU1: Temperature above threshold, cpu clock throttled (total events = 37741)
Jun 16 12:17:22 localhost kernel: [177322.466719] CPU0: Temperature above threshold, cpu clock throttled (total events = 37635)
Jun 16 12:17:22 localhost kernel: [177322.470516] CPU1: Temperature/speed normal
Jun 16 12:17:22 localhost kernel: [177322.470525] CPU0: Temperature/speed normal
Jun 16 12:18:11 localhost kernel: [177370.900315] iwlagn 0000:03:00.0: No space for Tx
Jun 16 12:18:11 localhost kernel: [177370.900321] iwlagn 0000:03:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
Jun 16 12:18:11 localhost kernel: [177370.900342] iwlagn 0000:03:00.0: Error setting new RXON (-28)
Jun 16 12:18:11 localhost kernel: [177370.900353] iwlagn 0000:03:00.0: No space for Tx
Jun 16 12:18:11 localhost kernel: [177370.900374] iwlagn 0000:03:00.0: Error sending REPLY_SCAN_CMD: enqueue_hcmd failed: -28
Jun 16 12:18:11 localhost kernel: [177370.900410] iwlagn 0000:03:00.0: No space for Tx
Jun 16 12:18:11 localhost kernel: [177370.900413] iwlagn 0000:03:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
Jun 16 12:18:11 localhost kernel: [177370.900415] iwlagn 0000:03:00.0: Error setting new RXON (-28)
Jun 16 12:18:11 localhost kernel: [177370.900417] iwlagn 0000:03:00.0: No space for Tx
Jun 16 12:18:11 localhost kernel: [177370.900437] iwlagn 0000:03:00.0: Error sending REPLY_RXON: enqueue_hcmd failed: -28
Jun 16 12:18:11 localhost kernel: [177370.900439] iwlagn 0000:03:00.0: Error setting new RXON (-28)
Jun 16 12:18:11 localhost kernel: [177370.900444] iwlagn 0000:03:00.0: No space for Tx
Jun 16 12:18:11 localhost kernel: [177370.900447] iwlagn 0000:03:00.0: Error sending REPLY_TX_POWER_DBM_CMD: enqueue_hcmd failed: -28
Jun 16 19:15:11 localhost kernel: imklog 4.2.0, log source = /var/run/rsyslog/kmsg started.

Checking sensors I get:

Code:

$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:       +66.0�C  (crit = +104.0�C)                  

coretemp-isa-0000
Adapter: ISA adapter
Core 0:      +52.0�C  (high = +85.0�C, crit = +85.0�C)  

coretemp-isa-0001
Adapter: ISA adapter
Core 1:      +51.0�C  (high = +85.0�C, crit = +85.0�C)

I've never seen the Core 0 and Core 1 get 85+ in working with the system.

The above readings are what I'm used to seeing (sometimes around 60 if I'm working the cores).

Any idea what is going on and how to stop it?

I don't hear a lot of fan-control revving up like I used to before upgrading from Linux Mint Gloria (Ubuntu 9.04) to Linux Mint Helena (Ubuntu 9.10).

Not quite ready to do an upgrade yet as I need more time to set things up.

With thanks,
Narnie

HardInfo__0_5c__System_Report.pdf (105.0 KB)

Narnie

View Public Profile for Narnie

Find all posts by Narnie

06-18-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Narnie

Here is the most recent kernel error.

Quote:

It never says it is shutting down. It just "dies" even after saying the temp/speed is normal.

It's pretty hard for the kernel to fake a thermal throttling, I think it's really overheating. If linux has taken control of the fans away from your BIOS, glitched readings from lm_sensors could keep your fans at low speed even under heavy load. Bad calibration in lm_sensors.conf could prevent the fans speeding up as fast as they need to. Try disabling lm_sensors so it doesn't seize control of the sensors, leaving the fan under BIOS control. Try cat-ing the temperature from something like /proc/acpi/processor/CPU0/THRM (if available).

Last edited by Corona688; 06-18-2010 at 06:58 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-19-2010

Registered User

68, 1

Join Date: Jan 2010

Last Activity: 5 October 2014, 1:47 AM EDT

Posts: 68

Thanks Given: 0

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Corona688

Where are you getting these errors? If you're reading them from the logfile, understand that severe things like kernel panics won't be written to it -- a crashed system writes no files. To get really bad messages you have to watch the messages from a raw system console.

Where it says "machine check events logged" you can get more info by running the "mcelog" command. Though I think that buffer gets cleared by a reboot...

It's pretty hard for the kernel to fake a thermal throttling, I think it's really overheating. If linux has taken control of the fans away from your BIOS, glitched readings from lm_sensors could keep your fans at low speed even under heavy load. Bad calibration in lm_sensors.conf could prevent the fans speeding up as fast as they need to. Try disabling lm_sensors so it doesn't seize control of the sensors, leaving the fan under BIOS control. Try cat-ing the temperature from something like /proc/acpi/processor/CPU0/THRM (if available).

Thank you so much for this info.

I'll look into disabling lm_sensors (either uninstall or modprobe -r coretemp). Crossing my fingers.

Is there any way to monitor temps without taking control of the fans? I guess I need to look into more of what the lm_sensors package actually does.

I wrote a script that is logging the temps every 10 secs with alerts for "warm, hot, and critical." So far it has only hit warm (65 deg C -- 84 is critical) for no more than 4 inconsecutive times. I put a line that also logs the top processor utilizers to see if I could see what was doing it. Did have about 80 percent total processor utilization at those 4 separate times.

Thanks again for these tips.

I'll post back here with the results (but I won't be able to monitor the temps anymore).

Yours,
Narnie

PS

The logging was from /var/log/kern.log

---------- Post updated 06-19-10 at 08:26 AM ---------- Previous update was 06-18-10 at 11:35 PM ----------

Quote:

Originally Posted by Narnie

For anyone reading this thread having the same problem as I, lm_sensors is a service and can be diabled with :

Code:

sudo /etc/init.d/lm-sensors stop

I'm not sure if this is sufficient or of removing the module it loads is sufficient (in my case, the module was coretemp).

---------- Post updated at 08:35 AM ---------- Previous update was at 08:26 AM ----------

OK, this is strange.

It shut down last night with the last temp reading of 49 deg C.

Strange.

After booting back up, I wsa running nothing windowed but a terminal. I was checking some things when I just happened to check my temps manually.

The temps were at 82 deg C. Huh??? I did a quick

Code:

top

and found that gnome-do was using 200% of my CPU (double-huh???).

I quickly pkilled it and watched as the temp came back down to where I'm used to seeing it. It seems, gnome-do on my system does some run-away things. Bad gnome-do.

Then after i uninstalled lm-sensors and removed the coretemp module, I stressed the CPU. I am using the generic ati drivers as the proprietary is too buggy for me, so I fired up 5 instances of glxgears and watched the temp go up. With coretemp removed, only another temp sensor worked (still called CPU, but called virtual--another huh?). It's critical max is listed at 104 deg C.

It normally is in the upper 60s/low 70s. With this CPU rendering stress, it went up to 89 deg C, but no higher.

Unless there is an answer to this posting to tell me otherwise, I'm going to try to reenable coretemp and see if all is well under this CPU stress and see if just the lm-sensors service is the culprit.

Narnie

This User Gave Thanks to Narnie For This Post:

Narnie

View Public Profile for Narnie

Find all posts by Narnie

06-20-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Narnie

The temps were at 82 deg C. Huh??? I did a quick

Code:

top

and found that gnome-do was using 200% of my CPU (double-huh???).

Each core can go up to 100%. On a two-core system full utilization would be 200%, quad core 400%, etc.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-21-2010

Registered User

68, 1

Join Date: Jan 2010

Last Activity: 5 October 2014, 1:47 AM EDT

Posts: 68

Thanks Given: 0

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Corona688

Each core can go up to 100%. On a two-core system full utilization would be 200%, quad core 400%, etc.

IC, didn't know that. Thanks. I've never seen something use that much CPU. Don't know what it was doing, but whatever it was, it was literally toasting my system.

Narnie

Narnie

View Public Profile for Narnie

Find all posts by Narnie

Hardware

Overheating causing system shutdowns

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

2. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

3. BSD

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

4. AIX

How to know which process is causing the closed_wait?

Discussion started by: depam

5. AIX

Which Process is causing Paging?

Discussion started by: Chetanz

6. Shell Programming and Scripting

Nohup causing issues

Discussion started by: ChicagoBlues

7. UNIX for Dummies Questions & Answers

GCC causing problems it seems.

Discussion started by: Browser

8. UNIX for Dummies Questions & Answers

Causing a disk to be corrupt

Discussion started by: Yummator

9. Post Here to Contact Site Administrators and Moderators

HTML is causing problems

Discussion started by: Perderabo