Okay I know this thread is longer than most people's interest(and I promise this is my last email), but Salvatore has done some excellent sleuthing and determined the exact formula(including our particular parameters) by which our jobs are given priority.  Anyone who runs on the cluster should find this extremely relevant, and it can form the basis for in-person discussions (as suggested by Tim) on cluster refinements for general satisfaction:

-------- Original Message --------
Subject: Re: [Aspuru-Guzik group list] Queue in Odyssey
Date: Fri, 18 Jul 2014 12:30:18 -0400
From: Salvatore Mandrà <salvatore.mandra@gmail.com>
To: Jarrod <jarrod.mcc@gmail.com>


Backfill functionality is a separate issue from the primary scheduler being FIFO(basic) vs fairshare(multifactor), are you able to check that as well?

Sure!

You were right, multifactor option is activated:

$  cat /etc/slurm/slurm.conf | grep PriorityType

PriorityType=priority/multifactor

Looking at the documentation, the ranking of a job is defined as:

Job_priority =
(PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor)

All of the factors in this formula are floating point numbers that range from 0.0 to 1.0.

In our case:

PriorityWeightAge=1000
PriorityWeightFairshare=20000000
PriorityWeightJobSize=0
PriorityWeightPartition=100000000
PriorityWeightQOS=1000000000

where (https://computing.llnl.gov/linux/slurm/priority_multifactor.html#mfjppintro)

Age: the length of time a job has been waiting in the queue, eligible to be scheduled
Fair-share: the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed
Job size: the number of nodes a job is allocated
Partition: a factor associated with each node partition
QOS: a factor associated with each Quality Of Service

I guess that the job-dependent factors are: age, fair-share and job size (while partition and qos factors are jobs independent). As you can see, age seems to be not so important and it's dominated by the fair-share factor.

---------------------------------------------------------

Some analysis:

Age Factor

The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster's current limits.

At some configurable length of time (PriorityMaxAge), the age factor will max out to 1.0.

In our case, PriorityMaxAge = 7-0. This means that after 7 days (Am I right?), a job get a factor 1.0 in AgeFactor.

Fair-share Factor

The fair-share component to a job's priority influences the order in which a user's queued jobs are scheduled to run based on the portion of the computing resources they have been allocated and the resources their jobs have already consumed. The fair-share factor does not involve a fixed allotment, whereby a user's access to a machine is cut off once that allotment is reached. Instead, the fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

SLURM's fair-share factor is a floating point number between 0.0 and 1.0 that reflects the shares of a computing resource that a user has been allocated and the amount of computing resources the user's jobs have consumed. The higher the value, the higher is the placement in the queue of jobs waiting to be scheduled.

The computing resource is currently defined to be computing cycles delivered by a machine in the units of processor*seconds. Future versions of the fair-share factor may additionally include a memory integral component.

---------------------------------------------------------

Since the age_factor is really small compared to the fair-share factor, it is possible that jobs with a large fair-share factor could be served before than older jobs.

Cheers!

S