Gridengine tweaks

This page is a repository of recipes for administrators to tweak Sun Grid Engine.

Known external guides[edit]

These links surely have gone stale since sunsource is dead and rocks wiki is moved / deleted.

Implement share trees[edit]

Basic share tree to distribute evenly between all users: qconf -mstree

id=0
name=Root
type=0
shares=1
childnodes=1
id=1
name=default
type=0
shares=1000
childnodes=NONE

qconf -msconf
- set weight_tickets_share to 100000
- set weight_tickets_functional to 1000 (or something)
- set weight_ticket to 10.000000 (default 0.01) to make scheduling actually change

Also, might want to change the following for tuning: qconf -msconf

halftime (hours) (default 168)
policy_hierarchy OSF (default OFS)

Really ugly numerical sharetree data can be retrieved with $SGE_ROOT/utilbin/<arch>/sge_share_mon A more readable summary of sharetree data:

 /opt/gridengine/utilbin/lx26-amd64/sge_share_mon -f actual_share,node_name -c 1

references[edit]

restrict jobs to a single node[edit]

Copy the paralllel environment of choice renameed to *-one
change allocation rule:

allocation_rule    $pe_slots

Adjust mixing of single and multi-cpu jobs[edit]

Prevent round robin node allocation of single cpu jobs[edit]

qconf -msconf and change queue_sort_method from load to seqno
For each queue you wish to be filled sequentially intead of by load, qconf -mq qname copy slots list to seqno, and change the values (after =) to be monotonically increasing with no duplicates

Untested: what if some nodes are not given a sequence number? Do they inherit from the queue? are they allocated first? Probably, load will still be used within the unsequenced queues.

Prevent small jobs and large jobs from mixing[edit]

clone the primary queue and make a duplicate for each job size desired
edit the seqno for each queue, so that the queues are ordered from most restrictive to least restrictive so that jobs fall into the correct queue for their size
set limits to define differences between queues. Examples below.
add sequence numbers and change sort method (as above), but reverse the instance seqno order for queues that should not mix
add a rqs to prevent oversubscription of slots between queues: qconf -mrqs

{
  name         overslots
  description  prevent slot oversubscription
  enabled      TRUE
  limit        hosts {*} to slots=$num_proc
}

Sample queue restrictions

pe_list: some queues might use different (or no) parallel envrionments
qtype: set to NONE to only allow parallel jobs
access restrictions: user, project, department, etc.

Other resource limits are listed in queue_conf

Prevent jobs from spanning nodes[edit]

Create a special parallel environment (qconf -sp mpich) and change allocation_rule possible values are: (see sge_pe man page for complete list) Don't forget to modify each queue to add the new PE to the PE list.

$fill_up: fill up nodes until slots is satisfied
$pe_slots: jobs may not span nodes (may also use /$pe_slots with int option)
integer: for example, half cpus available to prevent hyperthreading or twice cpus available to intentionally oversubscribe

Bug: a job split across multiple top level queues won't honor the pe_slots directive.

Bug: these sort of work; they seem to weight things to happen correctly, but sometimes they don't anyway.

Short queue[edit]

Job turn-around time may be reduced by reserving slots for short jobs (where dfn of "short" depends on local usage). There are two ways to set up a short queue; you can use projects or resources.

The short complex resource has the advantage that you can tweak priority using urgency (in the complex line), and that jobs tagged with -l short=1 will be forced into the short queue.

The other alternative below uses projects, which has the advantage that you can use share trees or functional priorities to adjust job priorities, and jobs tagged with -P short can optionally roll over into short.q or stay in all.q depending on resource quota set and seqno tweaks.

The resource quota sets otherwise apply equally to either method.

creating the short queue[edit]

(used for both methods)

create a new queue (clone or otherwise), change the following entries:

s_rt                  23:58:00
h_rt                  24:0:0

This limits jobs in the short queue to one day, after which they will be killed if they don't finish first.

Optionally reserve slots by restricting slot counts in other queues with a resource quota set: qconf -arqs shortq

{
  name         shortq
  description  Reserve a few systems for the short queue
  enabled      TRUE
  limit        queues {short.q} to slots=300
  limit        queues * to slots=224
}

The first limit line allows the second line to only apply to queues other than the short queue, and may also limit how many jobs can go into the short queue. The second line should be a few slots less than the total in qstat -g c so the rest of the queues leave those slots unused for the short queue.

short queue using complex resources[edit]

To use resources, see [1]; but essentially, use qconf -mc and add

short    short  BOOL     ==    FORCED  NO   0   0

and then qconf -mq short.q and add

 complex short=1

Jobs tagged with -l short=1 will be forced into the short queue, with higher urgency (if specified as non-zero in the complex), but will be limited by the total set in the resource quota and the time set in short.q.

short queue using project[edit]

Users can tag a job for the short queue by adding -P short to the command line of either qsub or qalter.

set up a project ACL to tag jobs as short:

% qconf -aprj 
name short
oticket 0
fshare 0
acl NONE
xacl NONE

modify the queue to add the project: qconf -mq short.q

projects              short

you may also want to adjust seqno so that jobs tagged as short fall into either the default queue or the short queue first (assuming queues are ordered by seqno instead of load). If all.q is first, then jobs will only go into short.q when all.q is already full. If short.q is first, then jobs will go into short.q until the resource quota is full, after which jobs tagged as short will go into all.q. Jobs that end up in all.q will still get the short project prioritization and share tree adjustments, but won't be time limited as specified by all.q.

Optionally, priority for the short queue can be adjusted by either changing the project ticket numbers (qconf -mprj short) or by adding a project node to the share tree: qconf -mstree

id=0
name=Root
type=0
shares=1
childnodes=1,2
id=1
name=default
type=0
shares=1000
childnodes=NONE
id=2
name=short
type=1
shares=1000
childnodes=3
id=3
name=default
type=0
shares=1000
childnodes=NONE

Note that this gives 50% share between normal jobs and short jobs.

Number of jobs in each queue can be limited with quotas, as follows:

% qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
-------------------------------------------------------------------------------
all.q                             0         0    100    100      0    0
single.q                          0         0    100    100      0    0

% qconf -mrqs
{
  name         shortq
  description  Reserve a few systems for the short queue
  enabled      TRUE
  limit        queues short.q to slots=200
  limit        queues * to slots=96
}
{
  name         leavesome
  description  Leave some slots for other users
  enabled      TRUE
  limit        users {*} queues all.q to slots=50
  limit        users {*} to slots=75
}

This will not limit jobs in the short queue, but will limit non-shortq jobs to all nodes but one (with 4 cpus). Also, a single user can't use more than half the cluster (50 cpus) for normal jobs, or 3/4 of the cluster for short jobs.

Add head node as a compute node[edit]

(useful on head nodes with many cpus)

Follow the directions at [2] to make the current node an exec node
1. cd /opt/gridengine
2. ./install_execd
3. accept all defaults
add slots in appropriate queues

qconf -mq all.q

on slots line, add

 [$HOSTNAME=14]

where the number is a few less cpus than actually available -- reserve some for I/O.

Verify:

qstat -g c
qstat -f

Automatically renice jobs[edit]

SGE will renice jobs automatically depending on ticket value. This is useful when queues are set up to intentially overcommit machines. This allows unix nice values to be exploited to give very fine grain control of job scheduling even after the job has already started execution.

qconf -msconf set reprioritize_interval to the desired time granularity (recommend 0:2:0 [3])
qconf -mconf set reprioritize to 1 (true) (This setting is automatically adjusted.)

Fluent in parallel with checkpointing[edit]

Note: fix path names

See mmae:Help:Fluent#use_with_SGE for detailed use. To set this up:

Install fluent_pe :

qconf -ap fluent_pe

pe_name           fluent_pe
slots             30
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /share/apps/Fluent.Inc/addons/sge1.0/kill-fluent
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Install fluent_ckpt:

qconf -ackpt fluent_ckpt

ckpt_name          fluent_ckpt
interface          APPLICATION-LEVEL
ckpt_command       /share/apps/Fluent.Inc/addons/sge1.0/ckpt_command.fluent
migr_command       /share/apps/Fluent.Inc/addons/sge1.0/migr_command.fluent
restart_command    NONE
clean_command      NONE
ckpt_dir           NONE
signal             USR1
when               xsm

modify the following lines in the queue definition:

qconf -mq all.q

min_cpu_interval      02:00:00
ckpt_list             fluent_ckpt
pe_list               make fluent_pe mpi mpich orte

original templates for the above can be found in $FLUENT/addons/sge1.0/sample_*

license management[edit]

The -l fluent resource is achived with:

qconf -mc

#name    shortcut  type relop requestable consumable default urgency
fluent   flu       INT   <=    YES         YES        0        10
fluent-r flur      INT   <=    YES         YES        0        16
fluent-t flut      INT   <=    YES         YES        0        40

alternately (experimental)

fluent-r flur      INT   <=    YES         JOB        0        10
fluent-t flut      INT   <=    YES         JOB        0        16
fluent-p flup      INT   <=    YES         YES        0        40

flur=5 flut=25 flup=20+4*5

qconf -me global

complex_values        fluent=20

(limit to 20 licenses for this cluster) A dynamic load sensor also helps, but has a race condition.

qconf -mconf compute-0-0

load_sensor                  /usr/local/bin/fluent-load-sensor

fluent-load-sensor is in the cluster config set, and in ssd's roll

add a resource quota so it is easy to check current usage (qquota) using qconf -mrqs

{
  name         fluent
  description  Count fluent licenses in use within SGE
  enabled      TRUE
  limit        queues * to fluent=12
}

scratch load sensor[edit]

qconf -mc

scratchavail        scr        MEMORY      <=    YES         YES        0        0
scratchused         scru       MEMORY      >=    NO          NO         0        0

qconf -mconf global (svn in ~/bin/sge/ )

load_sensor                  /share/apps/local/bin/load-sensor

global scratch directory[edit]

Default scratch directory is /state/partition1 which is not available on the head node.

Add the following to extend-compute.xml in the post section:

# create scratch directory
chmod 755 /state/partition1
mkdir /state/partition1/scratch
chmod gou+rwx,+t /state/partition1/scratch
ln -s /state/partition1/scratch /scratch

Run selected lines from above to create scratch on the head node by hand
Note on Rocks 6, /state/partition1 only exists on compute nodes (it is now /export on the head node)

exclusive job access[edit]

If a job needs exclusive access to a node (i.e., parallel matlab jobs), you can set up a complex resource for it: Setup:

Create complex with qconf -mc

exclusive           excl       BOOL        EXCL    YES         YES        0        1000

Add complex to each execution host (or edit with qconf -me compute-0-0 etc.)

for i in `ganglia|grep compute` ; do
  qconf -aattr exechost complex_values exclusive=true $i
done

Submit job with -l exclusive

exclusive gpu acesss[edit]

Preliminary attempt to assign resources to the gpu...this may need tuning.

Create complex with qconf -mc

gpu                 gpu        INT         <=    FORCED         JOB        0        30

assign complex resources to relevant nodes

qconf -aattr exechost complex_values gpu=6 compute-0-0

submit jobs with -l gpu=1

FORCED prevents jobs from being scheduled on gpu nodes unless they request a gpu
default of 0 ; online forums suggest this should be NONE (why?) which rocks sge does not accept
JOB counts resources by the job, not by cpu
persistenced is also a good idea or nvidia-smi -c 3 -pm 1

Limit two jobs per user:

{
  name         gpulimit
  description  limit gpus per user
  enabled      TRUE
  limit        users {*} to gpu=2
}

Set gpu mode for exclusive / persistent

nvidia-smi -c 3
nvidia-smi -pm 1

Prohibit job notification via email[edit]

qconf -mconf

jsv_url                      script:/export/apps/local/sbin/nomail.jsv
jsv_allowed_mod              ac,h,i,e,o,j,N,p,w

default was

jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

Gridengine tweaks

Contents

Known external guides[edit]

Implement share trees[edit]

references[edit]

restrict jobs to a single node[edit]

Adjust mixing of single and multi-cpu jobs[edit]

Prevent round robin node allocation of single cpu jobs[edit]

Prevent small jobs and large jobs from mixing[edit]

Prevent jobs from spanning nodes[edit]

Short queue[edit]

creating the short queue[edit]

short queue using complex resources[edit]

short queue using project[edit]

Add head node as a compute node[edit]

Automatically renice jobs[edit]

Fluent in parallel with checkpointing[edit]

license management[edit]

scratch load sensor[edit]

global scratch directory[edit]

exclusive job access[edit]

exclusive gpu acesss[edit]

Prohibit job notification via email[edit]

Navigation menu

Gridengine tweaks

Known external guides[edit]

Implement share trees[edit]

references[edit]

restrict jobs to a single node[edit]

Adjust mixing of single and multi-cpu jobs[edit]

Prevent round robin node allocation of single cpu jobs[edit]

Prevent small jobs and large jobs from mixing[edit]

Prevent jobs from spanning nodes[edit]

Short queue[edit]

creating the short queue[edit]

short queue using complex resources[edit]

short queue using project[edit]

Add head node as a compute node[edit]

Automatically renice jobs[edit]

Fluent in parallel with checkpointing[edit]

license management[edit]

scratch load sensor[edit]

global scratch directory[edit]

exclusive job access[edit]

exclusive gpu acesss[edit]

Prohibit job notification via email[edit]

Navigation menu

Search