![]() |
|
|
|
|
|||||||
| Forums | Portal | Register | Forum Rules | FAQ | Contribute | Members List | Arcade | Search | Today's Posts | Mark Forums Read |
| AIX AIX is IBM's industry-leading UNIX operating system that meets the demands of applications that businesses rely upon in today's marketplace. |
|
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| IBM AIX on AS/400 Systems | analyzer | AIX | 3 | 11-10-2006 07:05 PM |
| rsh between systems | DeepakXavier | Shell Programming and Scripting | 1 | 05-17-2006 10:40 AM |
| having 2 systems | creative | UNIX for Dummies Questions & Answers | 4 | 06-25-2002 05:06 AM |
| 'make' problems (compliation problems?) | xyyz | UNIX for Advanced & Expert Users | 5 | 11-05-2001 07:47 PM |
| sun systems | khussain | IP Networking | 1 | 07-27-2001 06:33 PM |
|
|
Submit Tools | LinkBack | Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
PPE/POE problems on AIX 5.2 on p690 systems
We have PPE/POE problems on a 32 PE p690 system. After upgrading to the latest AIX 5.2 (ML 05) POE/PPE environment on a p690, we've noticed that mpi jobs could not start. I've trace the problem in the communication of poe client routines and the pmdv4 (/etc/pmdv4) partition
manager on our system. Specifically: % lslpp -l '*poe*' Fileset Level State Description ---------------------------------------------------------------------------- Path: /usr/lib/objrepos ppe.poe 4.1.1.6 APPLIED poe Parallel Operating Environment Path: /etc/objrepos ppe.poe 4.1.1.6 APPLIED poe Parallel Operating Environment with AIX patched to latest levels as of 03/29/2005. Trying poe on a simple 'hello world' mpi program: % poe -procs 4 ./mpi -ilevel 6 INFO: DEBUG_LEVEL changed from 0 to 4 D1<L4>: Open of file /home/miket/SC/LSF/host.list successful D1<L4>: mp_euilib = ip D1<L4>: task 0 agave.tamu.edu 165.91.16.6 10 D1<L4>: task 1 agave.tamu.edu 165.91.16.6 10 D1<L4>: task 2 agave.tamu.edu 165.91.16.6 10 D1<L4>: task 3 agave.tamu.edu 165.91.16.6 10 D1<L4>: node allocation strategy = 0 D1<L4>: Entering pm_contact, jobid is 0 D1<L4>: Jobid = 1113711767 D1<L4>: POE security method is COMPAT D1<L4>: Requesting service pmv4 D1<L4>: 1 master nodes D1<L4>: Socket file descriptor for master 0 (agave.tamu.edu) is 4 ERROR: 0031-024 agave.tamu.edu: no response; rc = -1 D1<L4>: Non-zero status -1 returned from pm_mgr_init D2<L4>: In pm_exit... About to call pm_remote_shutdown D2<L4>: Elapsed time for pm_remote_shutdown: 0 seconds D2<L4>: In pm_exit... Calling exit with status = -1 at Wed Mar 30 13:53:42 2005 ------------------- poe contacts pmdv4 via inetd, and pmdv4 leaves the following log entries each time on /tmp/mplog.PID : % cat /tmp/mplog.1306788 AIX Parallel Environment pmd4 version @(#) 2003/06/11 13:19:38 The ID of this process is 1306788 The version of this pmd for version checking is 4100 The hostname of this node is agave The short hostname of this node is agave Wed Mar 30 13:53:37 2005 ERROR: 0031-203 malformed from address: <Error 0> pmd_exit reached!, exit code is 1 No collective communication shared memory segments to clean up. ------------------- After tracing (truss) both poe (or mpi executables) and /etc/pmdv4, I've noticed that pmdv4 closes the socket to poe client, with the `malformed from address: <Error 0>' after talking for a while with it. A socket read on poe side returns 0 and the poe client simply quits. I do have ~/.rhosts on my home directory. However, the only services handled by inetd is SSH and the pmv4 and no rsh/exec allowed. But even allowing rsh did not change the behavior of pmdv4. Should I be doing something different for POE / PPE to work now? Note that we are not using load-leveler or any other resource manager. Any hint would be greatly appreciated! Thanks Michael |
|||
| Google The UNIX and Linux Forums |
| Forum Sponsor | ||
|
|