Back to the main page.

Bug 2940 - rescheduling jobs

Status CLOSED FIXED
Reported 2015-07-31 14:57:00 +0200
Modified 2016-06-14 16:14:55 +0200
Product: FieldTrip
Component: qsub
Version: unspecified
Hardware: PC
Operating System: Windows
Importance: P5 normal
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also:

Marcel Zwiers - 2015-07-31 14:57:12 +0200

If the matlab session on an execution host accepts and reads in a job it deletes the input.mat file immediately, i.e. before the job was successfully completed. However, if the matlab-session crashes, then the torque/maui/moab will reschedule and rerun the job on a different host. Then the matlab session will fail because it cannot find the (deleted) input.mat file. Proposed solution: Make 'rerunable' an option in qsubcellfun and if rerunable==true then only delete the input.mat file at the very end of the job


Marcel Zwiers - 2015-07-31 15:14:58 +0200

Just to be clear, I come across this problem all the time because (massive multi-core) nodes keep crashing and after a reboot of the node, torque reschedules the job to another node (and then matlab gives the missing input.mat file error).


Robert Oostenveld - 2015-08-19 15:52:43 +0200

done! mac011> svn commit Sending qsubcellfun.m Sending qsubexec.m Sending qsubfeval.m Transmitting file data ... Committed revision 10607.


Robert Oostenveld - 2016-06-14 16:14:55 +0200

Hereby I am closing multiple bugs that have been resolved for some time now. If you don't agree to the resolution, please reopen.