Running Stream-like Programs on Heterogeneous Multi-core Systems

The thesis is also available in PDF format.

Running Stream-like Programs
on Heterogeneous
Multi-core Systems

Paul Carpenter

Abstract

All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole.

Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program’s structure and data layout can be safely and reliably modified by high-level compiler transformations.

One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion.

Kernel unrolling coarsens granularity by batching up work into larger chunks, reducing overheads and potentially enabling data reuse and vectorisation. Kernel fusion combines multiple kernels into a single piece of code, which has two benefits. First, it can be used to match the number of kernels to the number of processors, required for static scheduling. Second, it coarsens granularity, which amortises overhead in a dynamic scheduler. Kernel unrolling and fusion are relatively straightforward to apply when the program is represented as a stream graph, even though they imply extensive changes to memory allocation and program control.

This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes—specifically software pipelining and buffer allocation—and it models the compiler’s ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels.

This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories.

The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead.

This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program’s behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.

The main contributions of this thesis are:

The Abstract Streaming Machine (ASM), a machine model and coarse-grain simulator for a statically scheduled stream compiler.
A new partitioning heuristic for stream programs, which balances the load across the target, including processors and communication links. It considers its effect on downstream passes, and it models the compiler’s ability to fuse kernels.
Two static queue sizing algorithms for stream programs, which determine the sizes of the buffers used to implement streams. The optimal buffer sizes are affected by latency and variability in computation costs, and are constrained by the sizes of local memories, which may be small.
Two new low-complexity adaptive dynamic scheduling algorithms for stream-like programs.
StarssCheck, a debugging tool for StarSs. This tool checks memory accesses performed by tasks and the main thread, and warns if the StarSs pragmas are incorrect.

Contents

This document was translated from L^AT_EX by H^EV^EA.