Intermittent connectivity issues with ROCKS on a compute cluster


 
Thread Tools Search this Thread
Operating Systems Linux Intermittent connectivity issues with ROCKS on a compute cluster
# 1  
Old 09-13-2010
Intermittent connectivity issues with ROCKS on a compute cluster

I have a cluster set up with a head node and compute nodes running TORQUE and MOAB. The distro is ROCKS 5.3. I've been having problems with the connectivity for the past couple weeks now. Every couple hours it seems like the network connectivity will just stop working: sometimes it'll start back up in 10-15 minutes, sometimes I have to reboot the machine. I have SAMBA set up, and the network drive I have mounted on my windows PC won't respond (often causing windows explorer to crash) and I can't putty in. During this time, if I already have a putty window open, I can do basic commands like "ls" and "cd" but qstat and pbsnodes don't work. If I'm putty'd into the head node, I can ssh into one of the compute nodes. Eventually the putty window will crash though. Also, I can ping the server just fine.

The SAMBA logs were reporting all sorts of problems:

[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
INTERNAL ERROR: Signal 7 in pid 9816 (3.0.33-3.15.el5_4)
[2010/09/10 03:51:29, 0] smbd/close.c:close_directory(430)
close_directory: Could not get share mode lock for Pao
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(44)

From:
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(41)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(45)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
[2010/09/10 03:51:29, 0] lib/util.c:smb_panic(1655)
INTERNAL ERROR: Signal 7 in pid 8475 (3.0.33-3.15.el5_4)
PANIC (pid 9816): internal error
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:30, 0] lib/util.c:log_stack_trace(1759)
[2010/09/10 03:51:30, 0] lib/fault.c:fault_report(44)

I turned off SAMBA, still have the same problems. /var/log/messages contained this:

Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1

bnx2i is some sort of driver for the broadcom network card. I updated the broadcom multi-function drivers and the firmware, still have problems. One thing I couldn't get working was the bnx2i iSCSI offload driver -- I ran into version issues with the RPMs. I've ran MEMTEST and a couple hardware diagnostic checks -- can't find any problems. Here's /var/log/messages from when I reboot the machine. Note that I hosed the x server somehow, and I'm not really worried about fixing that.

Sep 13 04:49:25 wantsh01 gdm[3930]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:49:29 wantsh01 mountd[3527]: Caught signal 15, un-registering and exiting.
Sep 13 04:52:12 wantsh01 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Sep 13 04:52:12 wantsh01 kernel: PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
Sep 13 04:52:12 wantsh01 kernel: PCI: Not using MMCONFIG.
Sep 13 04:52:13 wantsh01 kernel: intel_rng: FWH not detected
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:52:13 wantsh01 named[3028]: the working directory is not writable
Sep 13 04:52:19 wantsh01 sshd[3428]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Sep 13 04:52:19 wantsh01 xinetd[3445]: /etc/xinetd.d/RCS is not a regular file. It is being skipped.
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: In the system's table of devices NO devices found to scan
Sep 13 04:52:31 wantsh01 gdm[4042]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:40 wantsh01 gdm[4188]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:49 wantsh01 gdm[4210]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:53:19 wantsh01 gdm[3940]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:53:32 wantsh01 dhcpd: receive_packet failed on eth0: Network is down
Sep 13 04:53:33 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:53:33 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:53:37 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:53:37 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:54:50 wantsh01 snmpd[3379]: c64 32 bit check failed
Sep 13 04:55:20 wantsh01 snmpd[3379]: looks like a 64bit wrap, but prev!=new

Thanks for any help, I'd really appreciate some advice.
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. Solaris

Solaris 10 - remote connectivity issues

Hi, I am facing a weird issue here. we have a Solaris 10 server running 15 non-global zones. The issue is all the systems suddenly stopped connectivity from outside/remotely.. Ping is also not working. After some troubleshooting, what I observed is when all the non-global zones are halted,... (4 Replies)
Discussion started by: snchaudhari2
4 Replies

2. UNIX for Advanced & Expert Users

CentOS 6.8 with Rocks Cluster: ldconfig is not a symbolic link errors

Any help appreciated just logging in to this server which is a front end for Rocks Cluster 6.1.1. Getting the below errors: ldconfig ldconfig: /usr/lib/libX11.so.6 is not a symbolic link ldconfig: /usr/lib/libjpeg.so.62 is not a symbolic link ldconfig: /usr/lib/libpng12.so.0 is not a symbolic... (3 Replies)
Discussion started by: RobbieTheK
3 Replies

3. HP-UX

Network Connectivity Issues

Newbie with UNIX here. Currently troubleshooting a UNIX terminal we have. I determined it to be bad and swapped it out with a known good terminal. I went in and changed the IP address and host name to reflect the old terminal. Although now there is no connectivity. I swapped out the NIC... (1 Reply)
Discussion started by: kevinlord190
1 Replies

4. Linux

Rocks cluster 6.1 and MPICH2 problem??????

Hey friends, I am trying to execute a simple hello world in mpi on MPICH2 on Rocks cluster. here is the c source code. #include <mpi.h> #include <stdio.h> int main( int argc, char ** argv ) { MPI_Init( NULL, NULL ); int world_size; MPI_Comm_size( MPI_COMM_WORLD, &world_size ); ... (4 Replies)
Discussion started by: gabam
4 Replies

5. Solaris

pcn0 intermittent network connectivity issue

I have a solaris 10 x86 installed on a VMware server. It has been runing well for a couple weeks, but started to have network connectivity issue since last week. The network card seems to be up and down every one minute. So I got connection closed after I sshed to it for about one minute. #... (2 Replies)
Discussion started by: fredao
2 Replies

6. Solaris

Intermittent internet connectivity

Hello, I am a relative UNIX newbie and we are unable to get out to the internet past the router. I can ping everything within the network but can't get out on a consistent basis. The UNIX DBA was let go recently and I have had to step in and assume his duties. Unfortunately, I am not quite... (1 Reply)
Discussion started by: judo42
1 Replies
Login or Register to Ask a Question