Gridengine tweaks
This page is a repository of recipes for administrators to tweak Sun Grid Engine.
Known external guides[edit]
These links surely have gone stale since sunsource is dead and rocks wiki is moved / deleted.
- Sun: Tight Integration of the MPICH2 library into SGE
- Pretty pictures explain Functional vs Sharetree scheduling
- gridengine.info: resource quota set tips
- https://wiki.rocksclusters.org/wiki/index.php/Sun_GridEngine
- http://wiki.gridengine.info/wiki/index.php/Disabling_direct_ssh_connection_to_the_nodes
- http://wiki.gridengine.info/wiki/index.php/Using_Ganglia_As_Load_Sensor
[edit]
Basic share tree to distribute evenly between all users: qconf -mstree
id=0 name=Root type=0 shares=1 childnodes=1 id=1 name=default type=0 shares=1000 childnodes=NONE
- qconf -msconf
- set weight_tickets_share to 100000
- set weight_tickets_functional to 1000 (or something)
- set weight_ticket to 10.000000 (default 0.01) to make scheduling actually change
Also, might want to change the following for tuning: qconf -msconf
- halftime (hours) (default 168)
- policy_hierarchy OSF (default OFS)
Really ugly numerical sharetree data can be retrieved with $SGE_ROOT/utilbin/<arch>/sge_share_mon A more readable summary of sharetree data:
/opt/gridengine/utilbin/lx26-amd64/sge_share_mon -f actual_share,node_name -c 1
references[edit]
restrict jobs to a single node[edit]
- Copy the paralllel environment of choice renameed to *-one
- change allocation rule:
allocation_rule $pe_slots
Adjust mixing of single and multi-cpu jobs[edit]
Prevent round robin node allocation of single cpu jobs[edit]
- qconf -msconf and change queue_sort_method from load to seqno
- For each queue you wish to be filled sequentially intead of by load, qconf -mq qname copy slots list to seqno, and change the values (after =) to be monotonically increasing with no duplicates
Untested: what if some nodes are not given a sequence number? Do they inherit from the queue? are they allocated first? Probably, load will still be used within the unsequenced queues.
Prevent small jobs and large jobs from mixing[edit]
- clone the primary queue and make a duplicate for each job size desired
- edit the seqno for each queue, so that the queues are ordered from most restrictive to least restrictive so that jobs fall into the correct queue for their size
- set limits to define differences between queues. Examples below.
- add sequence numbers and change sort method (as above), but reverse the instance seqno order for queues that should not mix
- add a rqs to prevent oversubscription of slots between queues: qconf -mrqs
{ name overslots description prevent slot oversubscription enabled TRUE limit hosts {*} to slots=$num_proc }
Sample queue restrictions
- pe_list
- some queues might use different (or no) parallel envrionments
- qtype
- set to NONE to only allow parallel jobs
- access restrictions
- user, project, department, etc.
Other resource limits are listed in queue_conf
Prevent jobs from spanning nodes[edit]
Create a special parallel environment (qconf -sp mpich) and change allocation_rule possible values are: (see sge_pe man page for complete list) Don't forget to modify each queue to add the new PE to the PE list.
- $fill_up
- fill up nodes until slots is satisfied
- $pe_slots
- jobs may not span nodes (may also use /$pe_slots with int option)
- integer
- for example, half cpus available to prevent hyperthreading or twice cpus available to intentionally oversubscribe
Bug: a job split across multiple top level queues won't honor the pe_slots directive.
Bug: these sort of work; they seem to weight things to happen correctly, but sometimes they don't anyway.
Short queue[edit]
Job turn-around time may be reduced by reserving slots for short jobs (where dfn of "short" depends on local usage). There are two ways to set up a short queue; you can use projects or resources.
The short complex resource has the advantage that you can tweak priority using urgency (in the complex line), and that jobs tagged with -l short=1 will be forced into the short queue.
The other alternative below uses projects, which has the advantage that you can use share trees or functional priorities to adjust job priorities, and jobs tagged with -P short can optionally roll over into short.q or stay in all.q depending on resource quota set and seqno tweaks.
The resource quota sets otherwise apply equally to either method.
creating the short queue[edit]
(used for both methods)
- create a new queue (clone or otherwise), change the following entries:
s_rt 23:58:00 h_rt 24:0:0
This limits jobs in the short queue to one day, after which they will be killed if they don't finish first.
- Optionally reserve slots by restricting slot counts in other queues with a resource quota set: qconf -arqs shortq
{ name shortq description Reserve a few systems for the short queue enabled TRUE limit queues {short.q} to slots=300 limit queues * to slots=224 }
The first limit line allows the second line to only apply to queues other than the short queue, and may also limit how many jobs can go into the short queue. The second line should be a few slots less than the total in qstat -g c so the rest of the queues leave those slots unused for the short queue.
short queue using complex resources[edit]
To use resources, see [1]; but essentially, use qconf -mc and add
short short BOOL == FORCED NO 0 0
and then qconf -mq short.q and add
complex short=1
Jobs tagged with -l short=1 will be forced into the short queue, with higher urgency (if specified as non-zero in the complex), but will be limited by the total set in the resource quota and the time set in short.q.
short queue using project[edit]
Users can tag a job for the short queue by adding -P short to the command line of either qsub or qalter.
- set up a project ACL to tag jobs as short:
% qconf -aprj name short oticket 0 fshare 0 acl NONE xacl NONE
- modify the queue to add the project: qconf -mq short.q
projects short
- you may also want to adjust seqno so that jobs tagged as short fall into either the default queue or the short queue first (assuming queues are ordered by seqno instead of load). If all.q is first, then jobs will only go into short.q when all.q is already full. If short.q is first, then jobs will go into short.q until the resource quota is full, after which jobs tagged as short will go into all.q. Jobs that end up in all.q will still get the short project prioritization and share tree adjustments, but won't be time limited as specified by all.q.
- Optionally, priority for the short queue can be adjusted by either changing the project ticket numbers (qconf -mprj short) or by adding a project node to the share tree: qconf -mstree
id=0 name=Root type=0 shares=1 childnodes=1,2 id=1 name=default type=0 shares=1000 childnodes=NONE id=2 name=short type=1 shares=1000 childnodes=3 id=3 name=default type=0 shares=1000 childnodes=NONE
Note that this gives 50% share between normal jobs and short jobs.
Number of jobs in each queue can be limited with quotas, as follows:
% qstat -g c CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE ------------------------------------------------------------------------------- all.q 0 0 100 100 0 0 single.q 0 0 100 100 0 0
% qconf -mrqs { name shortq description Reserve a few systems for the short queue enabled TRUE limit queues short.q to slots=200 limit queues * to slots=96 } { name leavesome description Leave some slots for other users enabled TRUE limit users {*} queues all.q to slots=50 limit users {*} to slots=75 }
This will not limit jobs in the short queue, but will limit non-shortq jobs to all nodes but one (with 4 cpus). Also, a single user can't use more than half the cluster (50 cpus) for normal jobs, or 3/4 of the cluster for short jobs.
Add head node as a compute node[edit]
(useful on head nodes with many cpus)
- Follow the directions at [2] to make the current node an exec node
- cd /opt/gridengine
- ./install_execd
- accept all defaults
- add slots in appropriate queues
qconf -mq all.q
on slots line, add
[$HOSTNAME=14]
where the number is a few less cpus than actually available -- reserve some for I/O.
- Verify:
qstat -g c qstat -f
Automatically renice jobs[edit]
SGE will renice jobs automatically depending on ticket value. This is useful when queues are set up to intentially overcommit machines. This allows unix nice values to be exploited to give very fine grain control of job scheduling even after the job has already started execution.
- qconf -msconf set reprioritize_interval to the desired time granularity (recommend 0:2:0 [3])
- qconf -mconf set reprioritize to 1 (true) (This setting is automatically adjusted.)
Fluent in parallel with checkpointing[edit]
Note: fix path names
See mmae:Help:Fluent#use_with_SGE for detailed use. To set this up:
- Install fluent_pe :
qconf -ap fluent_pe
pe_name fluent_pe slots 30 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /share/apps/Fluent.Inc/addons/sge1.0/kill-fluent allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min
- Install fluent_ckpt:
qconf -ackpt fluent_ckpt
ckpt_name fluent_ckpt interface APPLICATION-LEVEL ckpt_command /share/apps/Fluent.Inc/addons/sge1.0/ckpt_command.fluent migr_command /share/apps/Fluent.Inc/addons/sge1.0/migr_command.fluent restart_command NONE clean_command NONE ckpt_dir NONE signal USR1 when xsm
- modify the following lines in the queue definition:
qconf -mq all.q
min_cpu_interval 02:00:00 ckpt_list fluent_ckpt pe_list make fluent_pe mpi mpich orte
- original templates for the above can be found in $FLUENT/addons/sge1.0/sample_*
license management[edit]
The -l fluent resource is achived with:
- qconf -mc
#name shortcut type relop requestable consumable default urgency fluent flu INT <= YES YES 0 10 fluent-r flur INT <= YES YES 0 16 fluent-t flut INT <= YES YES 0 40
alternately (experimental)
fluent-r flur INT <= YES JOB 0 10 fluent-t flut INT <= YES JOB 0 16 fluent-p flup INT <= YES YES 0 40
flur=5 flut=25 flup=20+4*5
- qconf -me global
complex_values fluent=20
(limit to 20 licenses for this cluster) A dynamic load sensor also helps, but has a race condition.
- qconf -mconf compute-0-0
load_sensor /usr/local/bin/fluent-load-sensor
fluent-load-sensor is in the cluster config set, and in ssd's roll
- add a resource quota so it is easy to check current usage (qquota) using qconf -mrqs
{ name fluent description Count fluent licenses in use within SGE enabled TRUE limit queues * to fluent=12 }
scratch load sensor[edit]
- qconf -mc
scratchavail scr MEMORY <= YES YES 0 0 scratchused scru MEMORY >= NO NO 0 0
- qconf -mconf global (svn in ~/bin/sge/ )
load_sensor /share/apps/local/bin/load-sensor
global scratch directory[edit]
Default scratch directory is /state/partition1 which is not available on the head node.
- Add the following to extend-compute.xml in the post section:
# create scratch directory chmod 755 /state/partition1 mkdir /state/partition1/scratch chmod gou+rwx,+t /state/partition1/scratch ln -s /state/partition1/scratch /scratch
- Run selected lines from above to create scratch on the head node by hand
- Note on Rocks 6, /state/partition1 only exists on compute nodes (it is now /export on the head node)
exclusive job access[edit]
If a job needs exclusive access to a node (i.e., parallel matlab jobs), you can set up a complex resource for it: Setup:
- Create complex with qconf -mc
exclusive excl BOOL EXCL YES YES 0 1000
- Add complex to each execution host (or edit with qconf -me compute-0-0 etc.)
for i in `ganglia|grep compute` ; do qconf -aattr exechost complex_values exclusive=true $i done
- Submit job with -l exclusive
exclusive gpu acesss[edit]
Preliminary attempt to assign resources to the gpu...this may need tuning.
- Create complex with qconf -mc
gpu gpu INT <= FORCED JOB 0 30
- assign complex resources to relevant nodes
qconf -aattr exechost complex_values gpu=6 compute-0-0
submit jobs with -l gpu=1
- FORCED prevents jobs from being scheduled on gpu nodes unless they request a gpu
- default of 0 ; online forums suggest this should be NONE (why?) which rocks sge does not accept
- JOB counts resources by the job, not by cpu
- persistenced is also a good idea or nvidia-smi -c 3 -pm 1
Limit two jobs per user:
{ name gpulimit description limit gpus per user enabled TRUE limit users {*} to gpu=2 }
Set gpu mode for exclusive / persistent
nvidia-smi -c 3 nvidia-smi -pm 1
Prohibit job notification via email[edit]
qconf -mconf
jsv_url script:/export/apps/local/sbin/nomail.jsv jsv_allowed_mod ac,h,i,e,o,j,N,p,w
default was
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w