Adding retries to sacct/scontrol calls and other improvements (Fixes BUGZ-3966)
The main changes involve handling the situation when sacct failes due to e.g. slurmdbd being overloaded. In that case the jobinfostring will not be successfully filled, causing wrong info in the diag file.
The changes attempts to handle this situation in two ways
- 3 retries are attempted in the handle_diag_file function
- If all 3 retries fail, then the handle_exit code function is not called. This prevents the lrms_done file from being written. The next time the scan-SLURM-job is called, the job will again be picked up.
- Inside handle_exitcode and handle_exitcode_cancelled a retry loop is also included, and the udating of the lrms_done is only done if sacct/scontrol is retrieved correctly, like for handle_diag_file.
Edited by Maiken