Dear colleagues,

Since not everyone was able to make it to last week's meeting with RC we wanted to write up a short summary of what was discussed.

1. The primary issue is that our group is the heaviest user of the Odyssey cluster (we're using ~14% of it today). This was causing the fairshare number for every member in the group to be set to zero. Thus our partition was acting as a 'first in, first out', even though we have some 'power users' and some 'light users' in the group.

2. Some users in the group had accidentally been assigned the status of 'parent' for the calculation of their fairshare. This was messing up the calculation of fairshare numbers.

To resolve points 1 & 2, John Brunelle bumped the Aspuru-Guzik lab's 'raw shares' number to 1000 and set each AAG lab member's raw share number to 100, whereas other labs have a 'raw share' of 100 and user status of 'parent'. This effectively gives us enough decimal places in the fair share calculation to resolve differences within our group, but we still have a low enough fairshare number to not be given too unfair a treatment in the general queue.

3. Since the 'aspuru-guzik' partition is an owned resource, there is no default time-limit as there is with the 'general' partition. Failing to specify a time limit makes life difficult for the scheduler. We proposed having a default time limit of 7 days for our partition. John is investigating if it is possible to implement a 'default' time limit. This will be a default only, not a cap. i.e. if you specify a -t limit greater than 7 days, that will be allowed. It is just meant to safeguard against lazy usage where someone forgets to specify -t. In the mean time, please continue to do your best to specify time and memory limits for your jobs, as it will maximize our usage of this shared resource. At present even setting a time limit of six months is better for the scheduler than not setting any time limit.

4. We don't have a testing queue, but it is possible to use any partition in interactive mode, not just the -p 'interact' nodes. To do this on the aspuru-guzik partition you can do 'srun -p aspuru-guzik ...' entirely similarly to the way that you launch interactive jobs on the interactive partition nodes. It is usually pretty easy to get time on the interactive nodes so those at the meeting felt that there was not a need at this time to set up a separate aspuru-guzik test queue.

John and others at RC are being very responsive to our concerns with the cluster even though RC is very understaffed at the moment. We owe them a lot of thanks.

Best wishes,

-Martin & Sam