Load Balance and Parallel Performance
Load Balance and Parallel Performance (PDF 199KB)
Abstract
Load balancing an application workload among threads is critical to performance. The key objective for load balancing is to minimize idle time on threads. Sharing the workload equally across all threads with minimal work sharing overheads results in fewer cycles wasted with idle threads not advancing the computation, and thereby leads to improved performance. However, achieving perfect load balance is non-trivial, and it depends on the parallelism within the application, workload, the number of threads, load balancing policy, and the threading implementation.
This article is part of the larger series, "Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.
This article is part of the larger series, "Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.
Background
An idle core during computation is a wasted resource, and when effective parallel execution could be running on that core, it increases the overall execution time of a threaded application. This idleness can result from many different causes, such as fetching from memory or I/O. While it may not be possible to completely avoid cores being idle at times, there are measures that programmers can apply to reduce idle time, such as overlapped I/O, memory prefetching, and reordering data access patterns for better cache utilization.
Similarly, idle threads are wasted resources in multithreaded executions. An unequal amount of work being assigned to threads results in a condition known as a "load imbalance." The greater the imbalance, the more threads will remain idle and the greater the time needed to complete the computation. The more equitable the distribution of computational tasks to available threads, the lower the overall execution time will be.
As an example, consider a set of twelve independent tasks with the following set of execution times: {10, 6, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1}. Assuming that four threads are available for computing this set of tasks, a simple method of task assignment would be to schedule each thread with three total tasks distributed in order. Thus, Thread 0 would be assigned work totaling 20 time units (10+6+4), Thread 1 would require 8 time units (4+2+2), Thread 2 would require 5 time units (2+2+1), and Thread 3 would be able to execute the three tasks assigned in only 3 time units (1+1+1). Figure 1(a) illustrates this distribution of work and shows that the overall execution time for these twelve tasks would be 20 time units (time runs from top to bottom).
Similarly, idle threads are wasted resources in multithreaded executions. An unequal amount of work being assigned to threads results in a condition known as a "load imbalance." The greater the imbalance, the more threads will remain idle and the greater the time needed to complete the computation. The more equitable the distribution of computational tasks to available threads, the lower the overall execution time will be.
As an example, consider a set of twelve independent tasks with the following set of execution times: {10, 6, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1}. Assuming that four threads are available for computing this set of tasks, a simple method of task assignment would be to schedule each thread with three total tasks distributed in order. Thus, Thread 0 would be assigned work totaling 20 time units (10+6+4), Thread 1 would require 8 time units (4+2+2), Thread 2 would require 5 time units (2+2+1), and Thread 3 would be able to execute the three tasks assigned in only 3 time units (1+1+1). Figure 1(a) illustrates this distribution of work and shows that the overall execution time for these twelve tasks would be 20 time units (time runs from top to bottom).
Figure 1. Examples of task distribution among four threads.
A better distribution of work would have been Thread 0: {10}, Thread 1: {4, 2, 1, 1}, Thread 2: {6, 1, 1}, and Thread 3: {4, 2, 2, 2}, as shown in Figure 1(b). This schedule would take only 10 time units to complete and with only have two of the four threads idle for 2 time units each.
Advice
For the case when all tasks are the same length, a simple static division of tasks among available threads-dividing the total number of tasks into (nearly) equal-sized groups assigned to each thread-is an easy and equitable solution. In the general case, however, even when all task lengths are known in advance, finding an optimal, balanced assignment of tasks to threads is an intractable problem. When the lengths of individual tasks are not the same, a better solution may be a more dynamic division of tasks to the assigned threads.
The OpenMP* iterative worksharing construct typically defaults to static scheduling of iterations onto threads (if not, this can scheduling can be specified). When the workload varies among the iterations and the pattern is unpredictable, a dynamic scheduling of iterations to threads can better balance the load. Two scheduling alternatives, dynamic and guided, are specified through the schedule clause. Under dynamic scheduling, chunks of iterations are assigned to threads; when the assignment has been completed, threads request a new chunk of iterations. The optional chunk argument of the schedule clause denotes the fixed size of iteration chunks for dynamic scheduling.
The OpenMP* iterative worksharing construct typically defaults to static scheduling of iterations onto threads (if not, this can scheduling can be specified). When the workload varies among the iterations and the pattern is unpredictable, a dynamic scheduling of iterations to threads can better balance the load. Two scheduling alternatives, dynamic and guided, are specified through the schedule clause. Under dynamic scheduling, chunks of iterations are assigned to threads; when the assignment has been completed, threads request a new chunk of iterations. The optional chunk argument of the schedule clause denotes the fixed size of iteration chunks for dynamic scheduling.
1
#pragma omp parallel for schedule(dynamic, 5)
2
for
(i = 0; i < n; i++)
3
{
4
unknown_amount_of_work(i);
5
}
Guided scheduling initially assigns initially large chunks of iterations to threads; the number of iterations given to requesting threads is reduced in size as the set of unassigned iterations decreases. Because of the pattern of assignment, guided scheduling tends to require less overhead than dynamic scheduling. The optional chunk argument of the schedule clause denotes the minimum number of iterations in a chunk to be assigned under guided scheduling.
1
#pragma omp parallel for schedule(guided, 8)
2
for
(i = 0; i < n; i++)
3
{
4
uneven_amount_of_work(i);
5
}
A special case is when the workload between iterations is monotonically increasing (or decreasing). For example, the number of elements per row in a lower triangular matrix increases in a regular pattern. For such cases, setting a relatively low chunk size (to create a large number of chunks/tasks) with static scheduling may provide an adequate amount of load balance without the overheads needed for dynamic or guided scheduling.
1
#pragma omp parallel for schedule(static, 4)
2
for
(i = 0; i < n; i++)
3
{
4
process_lower_triangular_row(i);
5
}
When the choice of schedule is not apparent, use of the runtime schedule allows the alteration of chunk size and schedule type as desired, without requiring recompilation of the program.
When using the parallel_for algorithm from Intel® Threading Building Blocks (Intel® TBB), the scheduler divides the iteration space into small tasks that are assigned to threads. If the computation time of some iterations proves to take longer than other iterations, the Intel TBB scheduler is able to dynamically "steal" tasks from threads in order to achieve a better work load balance among threads.
Explicit threading models (e.g., Windows* threads, Pthreads*, and Java* threads) do not have any means to automatically schedule a set of independent tasks to threads. When needed, such capability must be programmed into the application. Static scheduling of tasks is a straightforward exercise. For dynamic scheduling, two related methods are easily implemented: Producer/Consumer and Boss/Worker. In the former, one thread (Producer) places tasks into a shared queue structure while the Consumer threads remove tasks to be processed, as needed. While not strictly necessary, the Producer/Consumer model is often used when there is some pre-processing to be done before tasks are made available to Consumer threads.
Under the Boss/Worker model, Worker threads rendezvous with the Boss thread whenever more work is needed, to receive assignments directly. In situations where the delineation of a task is very simple, such as a range of indices to an array of data for processing, a global counter with proper synchronization can be used in place of a separate Boss thread. That is, Worker threads access the current value and adjust (likely increment) the counter for the next thread requesting additional work.
Whatever task scheduling model is used, consideration must be given to using the correct number and mix of threads to ensure that threads tasked to perform the required computations are not left idle. For example, if Consumer threads stand idle at times, a reduction in the number of Consumers or an additional Producer thread may be needed. The appropriate solution will depend on algorithmic considerations as well as the number and length of tasks to be assigned.
When using the parallel_for algorithm from Intel® Threading Building Blocks (Intel® TBB), the scheduler divides the iteration space into small tasks that are assigned to threads. If the computation time of some iterations proves to take longer than other iterations, the Intel TBB scheduler is able to dynamically "steal" tasks from threads in order to achieve a better work load balance among threads.
Explicit threading models (e.g., Windows* threads, Pthreads*, and Java* threads) do not have any means to automatically schedule a set of independent tasks to threads. When needed, such capability must be programmed into the application. Static scheduling of tasks is a straightforward exercise. For dynamic scheduling, two related methods are easily implemented: Producer/Consumer and Boss/Worker. In the former, one thread (Producer) places tasks into a shared queue structure while the Consumer threads remove tasks to be processed, as needed. While not strictly necessary, the Producer/Consumer model is often used when there is some pre-processing to be done before tasks are made available to Consumer threads.
Under the Boss/Worker model, Worker threads rendezvous with the Boss thread whenever more work is needed, to receive assignments directly. In situations where the delineation of a task is very simple, such as a range of indices to an array of data for processing, a global counter with proper synchronization can be used in place of a separate Boss thread. That is, Worker threads access the current value and adjust (likely increment) the counter for the next thread requesting additional work.
Whatever task scheduling model is used, consideration must be given to using the correct number and mix of threads to ensure that threads tasked to perform the required computations are not left idle. For example, if Consumer threads stand idle at times, a reduction in the number of Consumers or an additional Producer thread may be needed. The appropriate solution will depend on algorithmic considerations as well as the number and length of tasks to be assigned.
Usage Guidelines
Any dynamic task scheduling method will entail some overhead as a result of parceling out tasks. Bundling small independent tasks together as a single unit of assignable work can reduce this overhead; correspondingly, if using OpenMP schedule clauses, set a non-default chunk size that will be the minimum number of iterations within a task. The best choice for how much computation constitutes a task will be based on the computation to be done as well as the number of threads and other resources available at execution time.
Additional Resources
Intel® Software Network Parallel Programming Community
Clay Breshears, The Art of Concurrency, O'Reilly Media, Inc., 2009.
Barbara Chapman, Gabriele Jost, and Ruud van der Post, Using OpenMP: Portable Shared Memory Parallel Programming, The MIT Press, 2007.
Intel® Threading Building Blocks
Intel Threading Building Blocks for Open Source
James Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism, O'Reilly Media, Inc. Sebastopol, CA, 2007.
M. Ben-Ari, Principles of Concurrent and Distributed Programming, Second Edition, Addison-Wesley, 2006.
Clay Breshears, The Art of Concurrency, O'Reilly Media, Inc., 2009.
Barbara Chapman, Gabriele Jost, and Ruud van der Post, Using OpenMP: Portable Shared Memory Parallel Programming, The MIT Press, 2007.
Intel® Threading Building Blocks
Intel Threading Building Blocks for Open Source
James Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism, O'Reilly Media, Inc. Sebastopol, CA, 2007.
M. Ben-Ari, Principles of Concurrent and Distributed Programming, Second Edition, Addison-Wesley, 2006.