Commit dc41daab authored by Andrii Salnikov's avatar Andrii Salnikov

DTR documents from the wiki reworked to rST (formatiing but not content)

parent 8df1eb96
Pipeline #4607 passed with stages
in 3 minutes and 6 seconds
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
.. _dtr_priority:
DTR priority and shares system
Here we describe the priority and shares system for the new data staging
framework. During the design stage there were several ideas taken from
other research in the field and the first implementation of the transfer
shares model in ARC.
Ideas behind priorities and fair-share in data staging
The initial idea was giving every DTR that comes into the system a fixed
priority, then sorting the queue according to priorities and launching
the first N DTRs with the highest priorities. This scheme also allows
easy incorporation of pre-emption: if the job with higher priority
appears it just pushes other DTRs out of the front of the queue and then
during the next scheduler loop we can start these DTRs and suspend the
pushed ones.
However, this idea can potentially lead to the situation that demanded
the implementation of transfer shares in ARC. If a user or VO with the
highest priority submits bunch of jobs at once all the other will be
blocked, because DTRs from this bunch will occupy the front of the queue
for a long time.
The idea of transfer shares comes in handy now. The available transfer
slots should be shared among different VOs/Users, so nobody would be
blocked. VOs/Users with higher priority get more transfer slots than the
other. However, strict limits on the number of slots per share are not
flexible enough - if the transfer pattern changes then strict limits
could cause problems, squeezing lots of users/jobs into one share with a
few slots and blocking others. The User or VO must also be able to
decide the relative priority of jobs within its own share.
Current Implementation
The ideas above led to the creation of two configurable properties:
user-defined job priority and server-defined share priority. Users may
define a priority for their jobs in the job description ("priority"
attribute in xrsl), and this is a measure of the priority given to this
job within the share it gets assigned to when submitted. On the
server-side, it is possible to define a share type, and priorities of
certain shares. The share priority is used to determine the number of
transfer slots to assign to the share, taking into account which shares
are currently active (active meaning the share has at least one DTR in
the system).
When the Scheduler receives a new DTR, it is placed into a transfer
share, which is defined by a User DN, VO, Group inside VO or role inside
VO as it was in previous versions of ARC. Currently it's possible to use
only one sharing criteria in the configuration, i.e. it's not possible
to use simultaneously sharing by User and VO.
Priority is defined as a number between 1 and 100 inclusive - a higher
number is a higher priority. In the A-REX configuration it is possible
to specify a base priority for certain shares. If the DTR doesn't belong
to any of these specified shares, it is placed in a "_default" share
with default base priority (50). The scheduler sets the priority of the
DTR to the base priority of the share multiplied by the user priority
from the job description (default 50) divided by 100, therefore default
priority of a DTR is 25. In this system the priority set in the job
description effectively defines a percentage of the base priority. Thus
service administrators can set maximum priority limits for certain
shares, but users or VOs have full control of their jobs' priority
within the share.
While revising the Delivery and Processor queues, the scheduler
separates DTRs according to the shares they belong to. Inside every
share DTRs are sorted according to their priorities. Then the scheduler
determines the number of transfer slots that every active share can
grab. The number is determined dynamically depending on priorities of
active shares. Each share receives the number of slots which corresponds
to the weight of its priority in the summed priority of all active
shares. After the number of slots for each share is determined the
scheduler just launches N[i] highest priority DTRs in each share, where
N[i] is the number of transfer slots for i-th share.
The reason for weighting the DTR priority by the share priority is for
occasions when the Scheduler considers the entire queue of DTRs, for
example when allowing highest priority DTRs to pass certain limits.
**Example:** there are two active shares, one has base priority 60, the
other 40. The summarized priority is 100 (60 + 40). The first share has
a weight of 60%, the second 40%. So the first will grab 60% of
configured transfer slots, and the second -- 40%. If the system is
configured with 5 Delivery slots, then the first share will take 3 slots
and the second 2 slots. The 3 highest priority DTRs from the first share
and 2 highest priority from the second share will be assigned to those
Emergency Shares
To avoid the situation where a fixed limit of slots are used up by slow
transfers and a new high priority transfer has to wait for a slot, we
have "emergency" transfer slots. If there are transfers in the queue
from a particular share, but all slots are filled with transfers from
other shares, one emergency slot can be assigned to this share to allow
transfers to start immediately. The share may use an emergency slot
until any other transfer finishes, at which point the emergency slot
becomes a regular slot and a new transfer does not start from the queue.
The Generator can assign DTRs to "sub-shares" to give a higher
granularity than the standard criteria and when assigning transfer
slots. Sub-shares are treated as separate shares. In A-REX, different
sub-shares are assigned to downloads and uploads, and in this case
emergency transfer slots prove useful for preventing jobs not being able
to finish because all transfer slots are taken by downloaders. If this
happens then emergency slots can be used for uploads.
Potential Problems
- Within a share, high priority jobs block low priority jobs. Thus if
there is a constant stream of high priority jobs in a share, then
some low priority jobs in the same share may never run. Possible
- Increasing the priority as the time spent in the queue increases
(returning to previous priority after leaving the queue). This is
currently implemented as increasing the priority by 1 every 5
minutes after the DTR's timeout has passed.
- Changing simple highest-priority-first to a random algorithm where
higher priorities are weighted higher
- Making a higher granularity of shares by splitting each priority
or priority range into its own share - this is probably too
A-REX Configuration
The configuration varies depending on the ARC version. In the examples
below VO roles are used to assign shares, the atlas slow_prod role is
assigned a low priority share and the atlas validation role is assigned
a higher priority share.
ARC 1.x
Shares are defined using the same options in the [grid-manager] section
arc.conf as in the old framework, but rather than a set number of slots
per share, a priority is specified.
.. code-block:: ini
maxloadshare="1 voms:role"
share_limit="atlas:slow-prod 20"
share_limit="atlas:validation 80"
ARC 2.x
The new options in the [data-staging] section should be used
.. code-block:: ini
definedshare="atlas:slow-prod 20"
definedshare="atlas:validation 80"
If both shares are active and there are 10 slots available, then DTRs in
the slow-prod share will get 2 slots and those in the validation share
get 8 slots, and so the jobs in the validation share will have a higher
throughput (assuming similar numbers of files and file sizes in each
type of job).
A user wants their job to be high (but not top) priority and specifies
("priority" = "80") in the job description. The user has a VOMS proxy
with no role defined and submits the job to a site with the above
configuration. The job is assigned to the default share and DTRs have
priority 40 (50 x 80 / 100). The user then creates a VOMS proxy with the
ATLAS validation role and submits another job with the same priority to
the same site. This time the job goes to the configured atlas:validation
share and the DTRs have priority 64 (80 x 80 / 100). Note that the
priority of a DTR only affects the its position within a share and does
not influence the distribution of slots between shares.
.. _dtr_protocols:
DTR Supported Protocols Overview
There follows a list of protocols which may or may not be supported by
our implementation. Protocols are ordered, from pure data transfer to
pure data indexing.
- Request/response exchange is synchronous. Communication is initiated
and controlled by client.
- Can redirect.
- Can report temporary inavailability. But has no support for
avalability estimation or request.
- Can transfer partially (chunks) if server allows
- Transfer can't be put on hold
- Transfer can be canceled at any time.
- Transfer can start from any place if server allows
- No integrated bandwidth control
- No way to ensure content returned by different request is the same
(dynamic content)
- Communication may include various metadata, both standard and
non-standard - creation time, file size, content type are examples of
standard and common metadata.
- Can report metadata without doing actual transfer
- Can compress data being transfered
- Can transfer in both directions (upload and download)
- Third party transfers not supported
- It is common to have it wrapped into TLS/SSL layer
- Request/response exchange is synchronous, but server may send
multiple responses per single request
- Control and data channels are separated
- Communication is initiated and controlled by client.
- Can't redirect.
- Can transfer partially (chunks) - if GridFTP
- Transfer can't be put on hold
- Transfer can be canceled at any time.
- Transfer can start from any place - if GridFTP
- Integrated bandwidth control - if GridFTP (need to check)
- No way to ensure content returned by different request is the same
(dynamic content possible but not common)
- Reports metadata without doing actual transfer
- Allows searching/listing for available data entities
- Can transfer in both directions (upload and download)
- Third party transfers supported
- GridFTP adds wrapping for control and data channels into secure
communication channels
- Access protocol to data stored on the CASTOR system developed by CERN
- To be decided whether rfio is supported
- Networked file system based on Kerberos authentication
- Location transparent
- All the files stored in afs are available through an afs client
connecting to an afs server
- To be decided whether afs is supported
- The native random access I/O protocol for files within dCache
- A version with GSI authentication is also available (gsidcap)
- To be decided whether dcap is supported
- Allows file system-like view of logical filename space, with
operation like mkdir, stat, mv
- Each file can have multiple replicas in different locations
- Each file has a unique GUID
- Files have Logical Names (LNs, paths in the namespace)
- Multiple LNs can point to the same GUID (hardlinks)
- Each file has metadata (including size, checksum, location of
- Communication is synchronous, and initiated by the client
- Negotiation is separate from file transfer, is via SOAP/HTTPS
- Can report metadata without doing actual transfer
- File transfer has as pluggable architecture, currently HTTP(S) is
- After negotiation a one-time transfer URL is generated (TURL), which
currently has no lifetime:
- if via the negotiation interface the client gets the TURL, then it
can store that TURL, and use it later
- however, the TURL can break (that replica could disappear, storage
node could go offline), then a new negotiation is needed for a new
- (in the current Hopi implementation) the TURL can be used only
once: no hold or resume - but chunks are supported (if request for
chunks are close in time)
- It is possible that a file has no valid replicas at a given point of
time, but the same file may have valid replicas later
- The negotiation is between a Bartender and a client. There should be
multiple Bartenders, and if a Bartender cannot be accessed, another
Bartenders should be tried. If none of the Bartenders can be
accessed, it should be tried again later.
- If the Bartender replies that there is no such file, then there is
not much point in trying it again later (except if the user uploads
the file later)
- For SOAP/HTTPS and HTTPS file transfer, X.509 certificate needed -
the access control is based on the DN of the user or the name of the
VO (from the extension of the user's X.509 certificate)
- No space management
- The current arc:// URL implementation has the LN in the URL (e.g.
arc:///niif/zsombor/myjobs refers to the logical name
'/niif/zsombor/myjobs'), but the LN in itself is not enough the
access Chelonia: the client needs a Bartender URL
- If the client knows some ISIS URLs, then it can ask for Bartender
- If the client knows some Bartender URLs, then it can use those
- If the client know only one Bartender URL, that could be a
problem, if that Bartender is offline the client cannot access
Chelonia even if there are other Bartenders in the system
- The arc:// URL can contain one Bartender URL in the form of:
arc:///niif/zsombor/myjobs?BartenderURL= which has the same
problem of having only one URL like the previous point, and it is
- see and
- To be decided whether xrootd is supported
- Storage Resource Broker
- To be decided whether SRB is supported
- Specifications for v2.2 of the protocol is at
- Protocol for negotiation of file transfer, not transfer itself
- Negotiation of transfer URL for read and write is asynchronous -
client must make request and poll until request is processed
- Other operations are synchronous (but listing operations can also be
- It is acceptable for requests to take an arbitrary length of time,
due to physical file replication or staging from low-latency media
behind the scenes
- Supported physical transfer protocols can include GridFTP, HTTP(S),
dcap, xrootd, rfio
- If a request is satisfied the transfer URL is "pinned" (guaranteed to
be non-exclusively available) for a set length of time
- Pins should be released on completion of physical transfer whether
successful or not
- Requests can't be put on hold
- Requests can be cancelled at any time (this does not cancel the
physical transfer)
- Provides space management by applying "space tokens" to files
representing areas of space with limited capacity
- Reports metadata without doing actual transfer
- Allows user to set various metadata such as access latency to files,
lifetime of files (some features are not implemented in any known
- Allows searching/listing for available data entities
- Can transfer in both directions (upload and download)
- Third party transfers supported
- File catalog developed by LCG
- Allows file system-like view of logical filename space
- Each file can have multiple replicas in different locations
- Each file has a unique GUID
- Metadata similar to file system with addition of checksum and flags
for fileclass and migration status (these last two are hangovers from
the CASTOR code on which LFC is based and do not have practical use
in LFC)
- Many file system operations supported eg mkdir, ln, stat, readdir
- Supports ACLs and unix-style permissions on files and directories
- Communication is synchronous and initiated and controlled by client
- Replica Location Service originally developed by Globus and EDG
- Maps file names in a flat logical name space to physical replicas
- Logical file names have attributes such as creation date, size as
well as user-defined attributes
- Coarse-grained security policy - read/write access is granted per DN
to entire catalog
- To be decided whether RLS is supported
......@@ -5,4 +5,6 @@ ARC Data Services Technical Description
:maxdepth: 2
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment