-
Notifications
You must be signed in to change notification settings - Fork 10
Description
The convenience script specification for number of ranks isn't robust enough.
srun -n 4 works but -n4 doesn't. i.e. osspcsamp "srun -n4 ./nbody".
srun with no rank specifier does not work either: osspcsamp "srun ./nbody"
From Martin S:
It seems to be very specific, though, since also “-n4” didn’t work (it needed the space). It would be good if we could make that a bit more general (space/no space, no -n argument, etc.).
However, when I run O|SS (the CBTF version), something breaks - for one, the scripts seem to grab the wrong “-n” from the command line and launch too many backends:
g23/schulz/prgs/smg2000/test> osspcsamp "srun ./smg2000 -n 50 50 50"
[openss]: pcsamp experiment using the default sampling rate: "100".
Creating topology file for slurm frontend node cab5 for SLURM_JOB_ID 2264568
Generated topology file: ./cbtfAutoTopology
Running pcsamp collector.
Program: srun ./smg2000 -n 50 50 50
Number of mrnet backends: 50
Topology file used: ./cbtfAutoTopology
executing mpi program: srun cbtfrun --mpi --mrnet -c pcsamp ./smg2000 -n 50 50 50
^Csrun: interrupt (one more within 1 sec to abort)
srun: tasks 0-3: running
174940133.251075: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:55994:3
174940133.251013: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47034:1
174940133.251037: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47034:1
174940133.251041: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47563:2
^Csrun: sending Ctrl-C to job 2264568.3
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cab5]: *** STEP 2264568.3 KILLED AT 2016-12-30T15:28:54 WITH SIGNAL 9 ***
(the -n 50 is an argument for code, not srun - which has its node number from a prior alloc)
When I remove the -n and run plain, things still get stuck:
g23/schulz/prgs/smg2000/test> osspcsamp "srun smg2000"
[openss]: pcsamp experiment using the default sampling rate: "100".
Creating topology file for slurm frontend node cab5 for SLURM_JOB_ID 2264568
Generated topology file: ./cbtfAutoTopology
Running pcsamp collector.
Program: srun smg2000
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
executing mpi program: srun cbtfrun --mpi --mrnet -c pcsamp smg2000
CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections
CBTF_MRNet_LW_connect: Failed for myRank 10001, mrank 10001, con_rank 1
CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections
CBTF_MRNet_LW_connect: Failed for myRank 10002, mrank 10002, con_rank 2
CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections
CBTF_MRNet_LW_connect: Failed for myRank 10003, mrank 10003, con_rank 3
Running with these driver parameters:
(nx, ny, nz) = (10, 10, 10)
(Px, Py, Pz) = (4, 1, 1)
(bx, by, bz) = (1, 1, 1)
(cx, cy, cz) = (1.000000, 1.000000, 1.000000)
(n_pre, n_post) = (1, 1)
dim = 3
solver ID = 0
^Csrun: interrupt (one more within 1 sec to abort)
srun: tasks 0-3: running
^Csrun: sending Ctrl-C to job 2264568.6
174940208.775166: Message.c[305] Message_send - MRN_send failed
174940208.776290: PeerNode.c[176] PeerNode_sendDirectly - Message_send() failed
174940208.776293: Network.c[839] Network_send_PacketToParent - upstream.send() failed
174940208.776295: Network.c[842] Network_send_PacketToParent - assume parent failure, try one more time
174940208.776332: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:46397:1
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cab5]: *** STEP 2264568.6 KILLED AT 2016-12-30T15:30:08 WITH SIGNAL 9 ***
^C
the mentioned files look OK:
g23/schulz/prgs/smg2000/test> cat attachBE_connections
cab5.llnl.gov 42880 0 0
g23/schulz/prgs/smg2000/test> u
Linux cab5 2.6.32-642.6.2.1chaos.ch5.5.x86_64 #1 SMP Mon Oct 24 10:49:01 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
15:30:19 up 25 days, 21:33, 0 users, load average: 0.60, 5.00, 9.29
g23/schulz/prgs/smg2000/test> cat cbtfAutoTopology
cab5:0 =>
cab5:1;