Dear all,
These are fantastic news. I am glad to hear that we are making progress in
the scheduling issues and finding out what was going on. I am so glad you
guys are working with RC on this!
Alan
On Tuesday, July 29, 2014, Martin Blood-Forsythe <
martin.bloodforsythe(a)gmail.com> wrote:
John has now implemented the DefaultTime of 7 days on
our partition.
MaxTime is still unlimited, so this is not an enforced cap.
$ scontrol show partition aspuru-guzik | grep Time
DefaultTime=7-00:00:00
DisableRootJobs=NO
GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED
MaxTime=UNLIMITED
MinNodes=1 LLN=NO
MaxCPUsPerNode=UNLIMITED
Hopefully these changes will improve the non-super user experience on or
partition.
Best wishes,
-Martin
Martin A. Blood-Forsythe
On Tue, Jul 29, 2014 at 1:33 PM, Sam Blau <samblau1(a)gmail.com
<javascript:_e(%7B%7D,'cvml','samblau1@gmail.com');>>
wrote:
Dear colleagues,
Since not everyone was able to make it to last week's meeting with RC we
wanted to write up a short summary of what was discussed.
1. The primary issue is that our group is the heaviest user of the
Odyssey cluster (we're using ~14% of it today). This was causing the
fairshare number for every member in the group to be set to zero. Thus our
partition was acting as a 'first in, first out', even though we have some
'power users' and some 'light users' in the group.
2. Some users in the group had accidentally been assigned the status of
'parent' for the calculation of their fairshare. This was messing up the
calculation of fairshare numbers.
To resolve points 1 & 2, John Brunelle bumped the Aspuru-Guzik lab's 'raw
shares' number to 1000 and set each AAG lab member's raw share number to
100, whereas other labs have a 'raw share' of 100 and user status of
'parent'. This effectively gives us enough decimal places in the fair
share calculation to resolve differences within our group, but we still
have a low enough fairshare number to not be given too unfair a treatment
in the general queue.
3. Since the 'aspuru-guzik' partition is an owned resource, there is no
default time-limit as there is with the 'general' partition. Failing to
specify a time limit makes life difficult for the scheduler. We proposed
having a default time limit of 7 days for our partition. John is
investigating if it is possible to implement a 'default' time limit. This
will be a default only, not a cap. i.e. if you specify a -t limit greater
than 7 days, that will be allowed. It is just meant to safeguard against
lazy usage where someone forgets to specify -t. In the mean time, please
continue to do your best to specify time and memory limits for your jobs,
as it will maximize our usage of this shared resource. At present even
setting a time limit of six months is better for the scheduler than not
setting any time limit.
4. We don't have a testing queue, but it is possible to use any partition
in interactive mode, not just the -p 'interact' nodes. To do this on the
aspuru-guzik partition you can do 'srun -p aspuru-guzik ...' entirely
similarly to the way that you launch interactive jobs on the interactive
partition nodes. It is usually pretty easy to get time on the interactive
nodes so those at the meeting felt that there was not a need at this time
to set up a separate aspuru-guzik test queue.
John and others at RC are being very responsive to our concerns with the
cluster even though RC is very understaffed at the moment. We owe them a
lot of thanks.
Best wishes,
-Martin & Sam
_____________________________________________
Aspuru-list mailing list
Aspuru-list(a)lists.fas.harvard.edu
<javascript:_e(%7B%7D,'cvml','Aspuru-list@lists.fas.harvard.edu');>
https://lists.fas.harvard.edu/mailman/listinfo/aspuru-list
--
Alán Aspuru-Guzik | Professor of Chemistry and Chemical Biology
Harvard University | 12 Oxford Street, Room M113 | Cambridge, MA 02138
(617)-384-8188 |