SGE idioms

Here is a cookbook of things you can do with SGE.

Explain error on queue

qstat -f -explain E

Investigate the error. Disk full? Needs a reboot? Then, clear the error on (the queues on) a machine

qmod -c '*@<machine-name>*' 

Who is running jobs on the GPUs ?

qstat -q gpu.q -f -u '*'

Find jobs in the Eqw state.

qstat -u '*' | grep Eqw

Investigate it. Directory was removed? or authentication problem? We know about this. For now, just clear the error on a job in the Eqw state

qmod -cj <jobid>