Dear all,

These are fantastic news. I am glad to hear that we are making progress in the scheduling issues and finding out what was going on. I am so glad you guys are working with RC on this!

Alan

On Tuesday, July 29, 2014, Martin Blood-Forsythe <martin.bloodforsythe@gmail.com> wrote:
John has now implemented the DefaultTime of 7 days on our partition. MaxTime is still unlimited, so this is not an enforced cap. 

$ scontrol show partition aspuru-guzik | grep Time
   DefaultTime=7-00:00:00 
   DisableRootJobs=NO 
   GraceTime=0 
   Hidden=NO
   MaxNodes=UNLIMITED 
   MaxTime=UNLIMITED 
   MinNodes=1 LLN=NO
   MaxCPUsPerNode=UNLIMITED

Hopefully these changes will improve the non-super user experience on or partition.
Best wishes,
-Martin

Martin A. Blood-Forsythe



On Tue, Jul 29, 2014 at 1:33 PM, Sam Blau <samblau1@gmail.com> wrote:
Dear colleagues,
Since not everyone was able to make it to last week's meeting with RC we wanted to write up a short summary of what was discussed.

1. The primary issue is that our group is the heaviest user of the Odyssey cluster (we're using ~14% of it today).  This was causing the fairshare number for every member in the group to be set to zero.  Thus our partition was acting as a 'first in, first out', even though we have some 'power users' and some 'light users' in the group.

2. Some users in the group had accidentally been assigned the status of 'parent' for the calculation of their fairshare.  This was messing up the calculation of fairshare numbers.

To resolve points 1 & 2, John Brunelle bumped the Aspuru-Guzik lab's 'raw shares' number to 1000 and set each AAG lab member's raw share number to 100, whereas other labs have a 'raw share' of 100 and user status of 'parent'.  This effectively gives us enough decimal places in the fair share calculation to resolve differences within our group, but we still have a low enough fairshare number to not be given too unfair a treatment in the general queue.

3. Since the 'aspuru-guzik' partition is an owned resource, there is no default time-limit as there is with the 'general' partition.  Failing to specify a time limit makes life difficult for the scheduler.  We proposed having a default time limit of 7 days for our partition.  John is investigating if it is possible to implement a 'default' time limit. This will be a default only, not a cap.  i.e. if you specify a -t limit greater than 7 days, that will be allowed.  It is just meant to safeguard against lazy usage where someone forgets to specify -t. In the mean time, please continue to do your best to specify time and memory limits for your jobs, as it will maximize our usage of this shared resource. At present even setting a time limit of six months is better for the scheduler than not setting any time limit. 

4. We don't have a testing queue, but it is possible to use any partition in interactive mode, not just the -p 'interact' nodes.  To do this on the aspuru-guzik partition you can do 'srun -p aspuru-guzik ...' entirely similarly to the way that you launch interactive jobs on the interactive partition nodes.  It is usually pretty easy to get time on the interactive nodes so those at the meeting felt that there was not a need at this time to set up a separate aspuru-guzik test queue.

John and others at RC are being very responsive to our concerns with the cluster even though RC is very understaffed at the moment.  We owe them a lot of thanks.

Best wishes,
-Martin & Sam

_____________________________________________
Aspuru-list mailing list
Aspuru-list@lists.fas.harvard.edu
https://lists.fas.harvard.edu/mailman/listinfo/aspuru-list




--
Alán Aspuru-Guzik | Professor of Chemistry and Chemical Biology
Harvard University | 12 Oxford Street, Room M113 | Cambridge, MA 02138
(617)-384-8188 | http://aspuru.chem.harvard.edu | http://about.me/aspuru