Commit e95f2a1b authored by Maiken's avatar Maiken

First attempt to add text from comment 3 in bugzilla ticket 3921 to separate...

First attempt to add text from comment 3 in bugzilla ticket 3921 to separate document as requested by Andrii. Surely will need some formatting fixes and other adjustments.
parent f44d57a8
Pipeline #8412 passed with stages
in 4 minutes and 45 seconds
.. _benchmark_issues
About accounting benchmarks
After the changes in the accounting system in ARC 6.4.0 there has been some issues related to missing benchmark values in accounting records, some related to a bug that unfortunately snuck in in 6.4.1. However sites can have issues with benchmarks for other reasons too. This page is an attempt to clarify how benchmarks work, in what situations problems can occur, and how to fix them.
Missing benchmarks fix
If you discover that some records use the default benchmark of 1.0 instead of your actual configured benchmark value in arc.conf, and/or you see "HEPSPEC 1.0 is being used " in the jura.log, your arc.conf benchmark values are not properly applied to the records.
You will need to manually fix these benchmark values, by issuing an sqlite query to replace the wrong benchmark value with the correct one. This shows an example where the controldir is the default /var/spool/arc/jobstatus, please change this to your value as defined in arc.conf.
.. code-block:: console
sqlite3 /var/spool/arc/jobstatus/accounting/accounting.db "insert into JobExtraInfo (RecordID, InfoKey, InfoValue) select distinct RecordID, 'benchmark', 'HEPSPEC:1.0' from JobExtraInfo where RecordID not in (select RecordID from JobExtraInfo where InfoKey='benchmark');"
Understanding benchmark handling
To understand how the "HEPSPEC 1.0 is being used" occurs in the jura.log there are 3 points to understand:
1. JURA is only the publisher and it sends the data about the jobs stored in the local ARC accounting database. NO values from arc.conf apart from where and how to publish records are used.
2. Info about benchmarks is part of the job accounting data stored in the ARC local accounting database when the job is in the finishing state. Any update to arc.conf AFTER the data about the job has been recorded HAS NO EFFECT on already stored records.
3. In case of publishing to APEL, the default method is to use "APEL summaries". This means that jura will send (update) the total counters about last 2 month of data that aggregated per VO, DN, Endpoint (include queue) and Benchmark (yep, this is part of APEL format). CONSEQUENTLY if any single job within 2 month timeframe is missing the benchmark data - this warning about using HEPSPEC 1.0 will be there!
PLEASE NOTE for pre ARC 6.8.0! As the APEL summary schema includes grouping by benchmark which was out of scope of the initial ARC accounting database design - the extra tables join is harmful to performance on heavy loaded sites!
The recommended mitigation to save ARC CE CPU cycles is to go back to individual usage records publishing (as it was pre 6.4.0) with "apel_messages = urs" option in arc.conf
We are thinking of the schema change to move benchmark from extra to mandatory attributes but as this is a not backward-compatible change this unlikely happens before ARC 7.
Why is a job missing benchmark?
There are at least 4 cases why a job can be missing benchmark data:
1. The job was started when ARC was at version < 6.5
2. The job was started when the [queue] block in arc.conf had no proper benchmark defined (HEPSPEC or Si2k types are allowed)
3. The permission or other filesystem issues prevent the writing of .diag files on the worker nodes
4. The job failed in LRMS before even execution of initial jobscript wrapper part (node failure, etc).
Apart from this, the data should be there. Here is a code snippet from job wrapper:
.. code-block:: console
> echo "echo \"Benchmark=$joboption_benchmark\" >> \"\$RUNTIME_JOB_DIAG\"" >> $LRMS_JOB_SCRIPT
> # add queue benchmark to frontend diag (for jobs that failed to reach/start in LRMS)
> echo "Benchmark=$joboption_benchmark" >> "${joboption_controldir}/job.${joboption_gridid}.diag"
The value of benchmark defined in the [queue] block in arc.conf is written to the diag file as it is, on both frontend (and in case of a failed submission to LRMS it will stay there) and worker node (the worker node diag replaces the frontend one after completion).
Should I do something with it?
If this is a rare single job that just failed in LRMS before writing the accounting data this is quite normal and you should not worry about the "HEPSPEC 1.0 is being used".
To identify how many jobs are missing benchmark data in the database, run the following query:
.. code-block:: console
sqlite3 /var/spool/arc/jobstatus/accounting/accounting.db "select JobID from AAR where RecordID not in ( select RecordID from JobExtraInfo where InfoKey='benchmark');"
This return job ID list with missing benchmark data. Than you can use
.. code-block:: console
arcctl accounting job info <JobID>
to find what are those jobs.
If there are many, that something definitely goes wrong and you should:
1. check the arc.conf syntax in respect to benchmark
2. check that the sessiondir .diag is written correctly
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment