Python code runs on login node but not on cluster Post: 302985382

Sponsored Content

Special Forums UNIX and Linux Applications High Performance Computing Python code runs on login node but not on cluster Post 302985382 by devinmgibson on Tuesday 8th of November 2016 05:29:10 PM

11-08-2016

Registered User

Python code runs on login node but not on cluster

I work for one of my professors and we are trying to run SU2 in parallel on a cluster owned by the university that uses slurm for its workload manager. The problem we are running into is that when we ssh into the cluster and run the command:

Code:

parallel_computation.py -f SU2.cfg

on an assigned node by slurm (using sbatch), the code hangs and wont run. The weird thing about this is if we run the same command on the login node, it works just fine. Do any of you know what could possibly be the problem?

Here is some additional information:
- We talked with the IT guy in charge of the cluster and he doesn't have enough background to know what is going on.
- On some of our output files we would get the escape key [!0134h, when we changed the terminal settings to get rid of the escape key the code behavior was consistent as above.
- We can run SU2_CFD "config file", the code in serial, on both the login node and the cluster just fine
- We have tried running an interactive session on a node (using srun), no change in behavior

Any thoughts would be appreciated! We really want to be able to run the code in-house instead of outsource.

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 11-09-2016 at 04:07 AM.. Reason: Added CODE tags.

devinmgibson

View Public Profile for devinmgibson

Find all posts by devinmgibson

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

The other node name of a SUN cluster

Hello, Under ksh I have to run a script on one of the nodes of a Solaris 8 cluster which at some time must execute a command on the alternate node: # rsh <name> "command" I have to implement this script on all the clusters of my company (a lot of...). Fortunately, the names of the two nodes...

2. HP-UX

Node can't join cluster

Need help guys! when running cmrunnode batch i'm getting this error cmrunnode : Waiting for cluster to...

3. HP-UX

MC/SG Fail to join cluster node

Hi, Please advise me whereas I have two node cluster server configured with MC/SG. Application and DB are running on Node 1, while Node 2 is standby. All the volume group devices are part of cluster environment. There is only one package running at node 1. Node 2 is having the problem to...

4. High Performance Computing

Removed crashed node from Solaris Cluster 3.0

All- I am new to these forums so please excuse me if this post is in the wrong place. I had a node crash in a 4 node cluster and mgmt has determined this node will not be part of the cluster when rebuilt. I am researching how to remove it from the cluster information on the other 3 nodes and...

5. High Performance Computing

Setting up 2 node cluster using solaris 10

hi, i am trying to setup a 2 node cluster environment. following is what i have; 1. 2 x sun ultra60 - 450MHz procs, 1GB RAM, 9GB HDD, solaris 10 2. 2 x HBA cards 3. 2 x Connection leads to connect ultra60 with D1000 4. 1 x D1000 storage box. 5. 3 x 9GB HDD + 2 x 36GB HDD first of all,...

6. Solaris

Active Sun cluster node?

I now the logical name and Virtual IP of the cluster. How can I find the active sun cluster node without having root access?

7. HP-UX

Identify cluster active node

Hello, Is there any way to identify the active node in a HP-UX cluster without root privileges?

8. Solaris

How to remove single node cluster

Hi Gurus, I am very new to clustering and for test i have created a single node cluster, now i want to remove the system from cluster. Did some googling however as a newbee in cluster unable to co related the info. Please help Thanks

9. Solaris

SVM metaset on 2 node Solaris cluster storage replicated to non-clustered Solaris node

Hi, Is it possible to have a Solaris cluster of 2 nodes at SITE-A using SVM and creating metaset using say 2 LUNs (on SAN). Then replicating these 2 LUNs to remote site SITE-B via storage based replication and then using these LUNs by importing them as a metaset on a server at SITE-B which is...

10. AIX

Cluster node not starting

Setting up HACMP 6.1 on a two node cluster. The other node works fine and can start properly on STABLE state (VGs varied, FS mounted, Service IP aliased). However, the other node is always stuck on ST_JOINING state. Its taking forever and you can't stop the cluster as well or recover from script...

LEARN ABOUT DEBIAN

slurm-llnl

SLURM(1)							   Slurm system 							  SLURM(1)

NAME

       slurm - SLURM system overview.

DESCRIPTION

       The  Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job
       scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively  self-con-
       tained.	As  a  cluster	resource  manager,  SLURM  has	three  key functions. First, it allocates exclusive and/or non-exclusive access to
       resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for  starting,  exe-
       cuting,	and  monitoring  work (normally a parallel job) on the set of allocated nodes.	Finally, it arbitrates contention for resources by
       managing a queue of pending work.  Optional plugins can be used for accounting, advanced reservation, gang  scheduling  (time  sharing  for
       parallel jobs), backfill scheduling, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

       SLURM has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibili-
       ties in the event of failure. Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits	for  work,
       executes  that  work, returns status, and waits for more work. An optional slurmDBD (SLURM DataBase Daemon) can be used for accounting pur-
       poses and to maintain resource limit information.

       Basic user tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, and	squeue	to
       report  the status of jobs. There is also an administrative tool scontrol available to monitor and/or modify configuration and state infor-
       mation. APIs are available for all functions.

       SLURM configuration is maintained in the slurm.conf file.

       Man pages are available for all SLURM commands, daemons, APIs, plus the slurm.conf file.  Extensive documenation is also available  on  the
       internet at <http://www.schedmd.com/slurmdocs/>.

COPYING

       Copyright  (C)  2005-2007 The Regents of the University of California.  Copyright (C) 2008-2009 Lawrence Livermore National Security.  Pro-
       duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).  CODE-OCEC-09-009. All rights reserved.

       This file is part of SLURM, a resource management program.  For details, see <http://www.schedmd.com/slurmdocs/>.

       SLURM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free
       Software Foundation; either version 2 of the License, or (at your option) any later version.

       SLURM  is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

SEE ALSO

       sacct(1), sacctmgr(1), salloc(1), sattach(1), sbatch(1), sbcast(1), scancel(1),	scontrol(1),  sinfo(1),  smap(1),  squeue(1),  sreport(1),
       srun(1),  sshare(1),  sstate(1),  strigger(1),  sview(1),  bluegene.conf(5),  slurm.conf(5),  slurmdbd.conf(5), wiki.conf(5), slurmctld(8),
       slurmd(8), slurmdbd(8), slurmstepd(8), spank(8)

slurm 2.0							    March 2009								  SLURM(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

The other node name of a SUN cluster

Discussion started by: heartwork

2. HP-UX

Node can't join cluster

Discussion started by: Tris

3. HP-UX

MC/SG Fail to join cluster node

Discussion started by: rauphelhunter

4. High Performance Computing

Removed crashed node from Solaris Cluster 3.0

Discussion started by: bluescreen