Dear colleagues,
Since not everyone was able to make it to last week's meeting with RC we
wanted to write up a short summary of what was discussed.
1. The primary issue is that our group is the heaviest user of the Odyssey
cluster (we're using ~14% of it today). This was causing the fairshare
number for every member in the group to be set to zero. Thus our partition
was acting as a 'first in, first out', even though we have some 'power
users' and some 'light users' in the group.
2. Some users in the group had accidentally been assigned the status of
'parent' for the calculation of their fairshare. This was messing up the
calculation of fairshare numbers.
To resolve points 1 & 2, John Brunelle bumped the Aspuru-Guzik lab's 'raw
shares' number to 1000 and set each AAG lab member's raw share number to
100, whereas other labs have a 'raw share' of 100 and user status of
'parent'. This effectively gives us enough decimal places in the fair
share calculation to resolve differences within our group, but we still
have a low enough fairshare number to not be given too unfair a treatment
in the general queue.
3. Since the 'aspuru-guzik' partition is an owned resource, there is no
default time-limit as there is with the 'general' partition. Failing to
specify a time limit makes life difficult for the scheduler. We proposed
having a default time limit of 7 days for our partition. John is
investigating if it is possible to implement a 'default' time limit. This
will be a default only, not a cap. i.e. if you specify a -t limit greater
than 7 days, that will be allowed. It is just meant to safeguard against
lazy usage where someone forgets to specify -t. In the mean time, please
continue to do your best to specify time and memory limits for your jobs,
as it will maximize our usage of this shared resource. At present even
setting a time limit of six months is better for the scheduler than not
setting any time limit.
4. We don't have a testing queue, but it is possible to use any partition
in interactive mode, not just the -p 'interact' nodes. To do this on the
aspuru-guzik partition you can do 'srun -p aspuru-guzik ...' entirely
similarly to the way that you launch interactive jobs on the interactive
partition nodes. It is usually pretty easy to get time on the interactive
nodes so those at the meeting felt that there was not a need at this time
to set up a separate aspuru-guzik test queue.
John and others at RC are being very responsive to our concerns with the
cluster even though RC is very understaffed at the moment. We owe them a
lot of thanks.
Best wishes,
-Martin & Sam