Sponsored Content
Special Forums Hardware Overheating causing system shutdowns Post 302430816 by Narnie on Saturday 19th of June 2010 09:35:08 AM
Old 06-19-2010
Quote:
Originally Posted by Corona688
Where are you getting these errors? If you're reading them from the logfile, understand that severe things like kernel panics won't be written to it -- a crashed system writes no files. To get really bad messages you have to watch the messages from a raw system console.

Where it says "machine check events logged" you can get more info by running the "mcelog" command. Though I think that buffer gets cleared by a reboot...

It's pretty hard for the kernel to fake a thermal throttling, I think it's really overheating. If linux has taken control of the fans away from your BIOS, glitched readings from lm_sensors could keep your fans at low speed even under heavy load. Bad calibration in lm_sensors.conf could prevent the fans speeding up as fast as they need to. Try disabling lm_sensors so it doesn't seize control of the sensors, leaving the fan under BIOS control. Try cat-ing the temperature from something like /proc/acpi/processor/CPU0/THRM (if available).
Thank you so much for this info.

I'll look into disabling lm_sensors (either uninstall or modprobe -r coretemp). Crossing my fingers.

Is there any way to monitor temps without taking control of the fans? I guess I need to look into more of what the lm_sensors package actually does.

I wrote a script that is logging the temps every 10 secs with alerts for "warm, hot, and critical." So far it has only hit warm (65 deg C -- 84 is critical) for no more than 4 inconsecutive times. I put a line that also logs the top processor utilizers to see if I could see what was doing it. Did have about 80 percent total processor utilization at those 4 separate times.

Thanks again for these tips.

I'll post back here with the results (but I won't be able to monitor the temps anymore).

Yours,
Narnie

PS

The logging was from /var/log/kern.log

---------- Post updated 06-19-10 at 08:26 AM ---------- Previous update was 06-18-10 at 11:35 PM ----------

Quote:
Originally Posted by Narnie
Thank you so much for this info.

I'll look into disabling lm_sensors (either uninstall or modprobe -r coretemp). Crossing my fingers.

Is there any way to monitor temps without taking control of the fans? I guess I need to look into more of what the lm_sensors package actually does.

I wrote a script that is logging the temps every 10 secs with alerts for "warm, hot, and critical." So far it has only hit warm (65 deg C -- 84 is critical) for no more than 4 inconsecutive times. I put a line that also logs the top processor utilizers to see if I could see what was doing it. Did have about 80 percent total processor utilization at those 4 separate times.

Thanks again for these tips.

I'll post back here with the results (but I won't be able to monitor the temps anymore).

Yours,
Narnie

PS

The logging was from /var/log/kern.log
For anyone reading this thread having the same problem as I, lm_sensors is a service and can be diabled with :
Code:
sudo /etc/init.d/lm-sensors stop

I'm not sure if this is sufficient or of removing the module it loads is sufficient (in my case, the module was coretemp).

---------- Post updated at 08:35 AM ---------- Previous update was at 08:26 AM ----------

OK, this is strange.

It shut down last night with the last temp reading of 49 deg C.

Strange.

After booting back up, I wsa running nothing windowed but a terminal. I was checking some things when I just happened to check my temps manually.

The temps were at 82 deg C. Huh??? I did a quick
Code:
top

and found that gnome-do was using 200% of my CPU (double-huh???).

I quickly pkilled it and watched as the temp came back down to where I'm used to seeing it. It seems, gnome-do on my system does some run-away things. Bad gnome-do.

Then after i uninstalled lm-sensors and removed the coretemp module, I stressed the CPU. I am using the generic ati drivers as the proprietary is too buggy for me, so I fired up 5 instances of glxgears and watched the temp go up. With coretemp removed, only another temp sensor worked (still called CPU, but called virtual--another huh?). It's critical max is listed at 104 deg C.

It normally is in the upper 60s/low 70s. With this CPU rendering stress, it went up to 89 deg C, but no higher.

Unless there is an answer to this posting to tell me otherwise, I'm going to try to reenable coretemp and see if all is well under this CPU stress and see if just the lm-sensors service is the culprit.

Narnie
This User Gave Thanks to Narnie For This Post:
 

9 More Discussions You Might Find Interesting

1. Post Here to Contact Site Administrators and Moderators

HTML is causing problems

I have to suggest that we turn HTML back off. The problem is that angle brackets are used in code and this is causing stuff to get dropped from posts. I know that we can use the constructs that PxT mentions in this thread. But look how hard it is to educate folks about code tags and the search... (4 Replies)
Discussion started by: Perderabo
4 Replies

2. UNIX for Dummies Questions & Answers

Causing a disk to be corrupt

Hmm, how to ask this without sounding too malicious... How might one go about causing a disk corruption in OS X specifically or via the command line in UNIX in general? Doesnt matter the severity of the problem, I just want to scare the person a little, then fix the problem for them. Any... (1 Reply)
Discussion started by: Yummator
1 Replies

3. UNIX for Dummies Questions & Answers

GCC causing problems it seems.

Hi, I seem to be getting errors in relation to GCC it seems as I cant upgrade alot of pkgs until I can upgrade or use a later version of GCC. The error I get is along the lines of ( cc1: error: unrecognized command line option "-Wno-pointer-sign" *** Error code 1 ) Anyway I was wondering if... (2 Replies)
Discussion started by: Browser
2 Replies

4. Shell Programming and Scripting

Nohup causing issues

Hi folks... I really need some help soon with this issue I am having when I run my script using 'nohup'. Below is a function 'checkReturn' that my script uses to check whether other functions or tasks errored out with a non-zero exit code. function checkReturn { if ; then ... (2 Replies)
Discussion started by: ChicagoBlues
2 Replies

5. AIX

Which Process is causing Paging?

Hello On one of our systems (AIX 5) I am seeing (vmstat) paging intermittently I want to know which process is causing the paging? I understand that first I would need to find out which process is consuming most memory 1) Is that right? 2) How to find it out? 3) By googling I found... (8 Replies)
Discussion started by: Chetanz
8 Replies

6. AIX

How to know which process is causing the closed_wait?

I do have a friend who have this script already but lost it. Can you please help to give me a script that can capture the closed_wait on the stack and identify which process using it. I am thinking of using netstat and rmsock. (2 Replies)
Discussion started by: depam
2 Replies

7. BSD

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (0 Replies)
Discussion started by: naveeng
0 Replies

8. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (1 Reply)
Discussion started by: naveeng
1 Replies

9. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (6 Replies)
Discussion started by: naveeng
6 Replies
All times are GMT -4. The time now is 01:25 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy