Computer Science Colloquium, 2003-2004

Edi Shmueli
Department of Computer Science, University of Haifa
March 31st, 2004

Backfilling with Lookahead to Optimize the Performance of Parallel Job Scheduling

A distributed-memory parallel machine consists of a set of processors, each associated with a private memory, which are connected using a fast network. A parallel job is composed of a number of concurrently executing processes communication using messages. To execute such a job, its processes are mapped to a set of processors using a one-to-one mapping. These processors remain dedicated to running this job until it terminates.

The scheduler is the software component which manages the machine. It typically maintains a queue of waiting jobs. It scans the queue and decides which job(s) to start based on the current available resources. This decision has a crucial effect on the performance of the machine.

I will begin this lecture with an introduction to parallel job scheduling. I will give examples and focus on backfilling algorithms - a class of commonly used scheduling algorithms which attempt to increase the machine utilization.

I will then present "LOS" - an acronym for "Lookahead Optimizing Scheduler". LOS is a scheduler that I developed as part of my M.A thesis with the supervision of Dr. Dror G. Feitelson of the Hebew University and Prof. Alek Vainshtein. Unlike traditional backfilling algorithms which consider the queue one job at a time, LOS uses dynamic programming to look deeper into the queue and to select a set of jobs, which together will maximize the machine utilization. I will present the simulation results of LOS, compare them to the results of traditional backfilling algorithms, and show that LOS outperforms these algorithms with respect to a wide set of performance metrics.

Shuly Wintner

Last modified: Wed Mar 24 18:20:00 IST 2004