Previous Up Next

Chapter 3  Compile-time Decisions

The stream compiler takes the human-readable source code, written in a stream programming language, and generates an executable that runs efficiently on the target machine. The compiler applies optimising transformations, including those described in this chapter.

The first high-level transformation enabled by stream programming is unrolling. Unrolling batches up work to amortise overheads in computation and communication. It also enables vectorisation [LA00] and data reuse. Automatically determining the unroll factors was not considered part of this thesis. MacroSS [HCW+10] is one technique that unrolls kernels to enable SIMD vectorisation. The unroll factors for the benchmarks in this chapter and Chapter 4 were set manually.

The next transformation is static partitioning, which decides which kernels should be fused together, and on which processors they should be executed. Clearly a good partitioning algorithm is crucial to high performance. A bad partition may be poorly balanced, loading most of the work onto one processor; or it may be well balanced but imply an excessive amount of communication back and forth between the processors. In which case, the communication links could become the bottleneck.

The final step is to statically allocate the buffers. This is an important problem, because it affects performance, as explained in Section 3.1, especially when computation times and communication rates are variable. Some platforms with distributed memory provide each core with very little addressable memory. For example, the Cell B.E. has just 256KB of local store per SPE, which must contain all code and data. On such platforms, it is important to allocate memory carefully.

3.1  Motivation

The stream program is built from kernels, which communicate through one-way channels known as streams. The programmer should be encouraged to write many small kernels, knowing that kernels on the same processor will be fused together, and that performance would be as good as if the programmer had done it.

The job of the partitioning algorithm is to decide which kernels should be fused together, and on which processors they should be executed. The goal is performance, taking account of constraints from the compiler. The partitioning algorithm optimises performance by trying to balance the load equally among the processors.

The choice of partition also affects performance indirectly through its effect on software pipelining and buffer allocation. Regarding the former, the partition may imply an excessively long software pipeline. Regarding the latter, in the worst case there may not be enough memory in local stores for the buffers implied by the partition, so the partition may not be realisable. Or there may only be space for small buffers, which are not sufficient to cover latencies and short-term variation.


(a) Convex, loosely connected partition
(b) Unrestricted partition
(c) Convex, strictly connected
Figure 3.1: Example partitions of the StreamIt filterbank benchmark onto a 3-core SMP. Each node is a kernel, each colour is a task, and each edge is a stream


(a) Convex partition from heuristic (time: 1.00)

(b) Unrestricted without pipelining (time: 2.11)

(c) Unrestricted partition with pipelining (time: 1.00)

Figure 3.2: Traces for five iterations of filterbank, scheduled using SGMS. Iterations are identified using shades of grey.


Figure 3.1 shows three partitions for the StreamIt filterbank benchmark on a 3-core SMP. Each processor has a single task containing the kernels of its colour. Figure 3.1(a) is the partition generated by the heuristic in Section 3.2. Data flow is from processor p1 (black) to p2 (grey) and p3 (white), and from p2 to p3—an acyclic graph. Figure 3.2(a) shows an execution trace, with shades of grey corresponding to five iterations.

Figure 3.1(b) shows a partition that would be optimal, ignoring the cost of software pipelining. This partition requires software pipelining, since otherwise, as shown in the trace in Figure 3.2(b), there are many stalls where dependencies prevent computation from being overlapped; throughput is 53% lower than before. Figure 3.2(c) is pipelined using the stage assignment phase from the SGMS algorithm [KM08]. It has 0.2% higher throughput than the convex partition, but due to startup overhead would break even only after 8,000 iterations.

The partitioning problem, even excluding its indirect effects and communications costs, is NP-hard, so it can only be solved using heuristics. Section 3.2 describes a new heuristic for the partitioning problem. It considers the loads on the processors and buses, considers its effect on downstream passes, and models the compiler’s ability to fuse kernels.

3.1.1  Convexity

A partition is convex if the graph of dependencies between tasks is acyclic. Equivalently, every directed path between two kernels in the same task is internal to that task. The convexity constraint is intended to avoid long software pipelines. As illustrated in the previous section, a partitioning algorithm unaware of the cost of pipelining may require long pipelines for a small increase in throughput. The optimal unrestricted partition for the StreamIt 2.1.1 serpent benchmark [GTA06] on two Cell B.E.s is 10% faster than the optimal convex partition, but it requires 209 pipeline stages rather than 31. We did not obtain the CPLEX Solver to evaluate StreamRoller, but since it uses ILP to solve a similar problem, its result should be similar. This translates into higher memory use, which may simply not fit, as well as startup overhead and latency. Table 3.1 shows that partitions from our algorithm seldom require pipelining at all, and performance is, on average, within 5% of optimum.

When the benefit from software pipelining is above some threshold, the algorithm relaxes connectedness and convexity. Section 3.2.3 shows the partition of vocoder, which benefits from software pipelining. The result is close to optimal performance using a short pipeline.

3.1.2  Connectivity

The connectedness constraint is primarily to help code generation, since it is easier to fuse adjacent kernels, whose relative frequencies are known via the stream between them. Figure 3.3(a) shows a program using the SPM (Section 1.5.2). Kernels read and write perform IO, and update manages the automaton and sends only accepting states. The macros NEW_STATE and ACCEPT_STATE manage the automaton, and their precise behaviour is irrelevant to the discussion. Consider the case where the partition merges read and write into task 1, with update in task 2. This partition is not convex, so pipelining is required. Task 1 is not connected, so the compiler requires the relative frequencies of read and write.

This can be solved using dynamic scheduling inside the task, by switching between kernels when a push or a pop starts to wait. Dynamic scheduling may not be supported by the run-time, and it adds overhead and unpredictability, which are undesirable in real-time embedded systems. Chapter 4 addresses dynamic scheduling of stream programs. In the absence of a runtime dynamic scheduler, this example requires either an extra stream carrying the condition, as in Figure 3.3(b), or duplicating the calculation of the condition, plus the state on which it is based, which would duplicate the whole update kernel.

The general case requires duplicating state or creating a dependence cycle. Figure 3.4 shows an example, not using the SPM, where each push and pop is guarded by a condition; e.g. each time k1 fires, it pushes on the stream to k3 whenever a is true and pushes on the stream to k4 whenever b is true. The relationship between the firing rates of any two kernels depends on all conditions on the path between them. If k2 and k3 are fused into one task, then the entire graph must be fused. This is because the task containing k2 and k3 must know the relationship between their firing rates. If therefore requires some function of e and f to be sent from k4, and this creates a directed cycle.

A naïve definition of connectivity, strict connectivity, considers a partition to be connected when each processor has a weakly connected subgraph. Unfortunately, wide split-joins, as in filterbank, do not usually have good partitions subject to this constraint. In Figure 3.1(a), p2 (grey) is not strictly connected, so our strict heuristic produces the partition in sub-figure (c), which has performance 28% worse than (a). In general, strict connectedness allows only the processors containing the split or the join kernel to have kernels from more than one branch.


(a) Source code for automaton
(b) Simplified compiled code
Figure 3.3: Motivation of connectivity: example programs with data dependent pushes and pops

We generalise connectivity by providing to the partitioning algorithm a set of basic connected sets [MB06], each of which specifies kernels that the compiler can pairwise merge. For strict connectivity, there is a basic connected set for each pair of communicating kernels.

This allows the partitioning algorithm to be adapted to the compiler and source language(s). If the compiler understands StreamIt [TKA02] splitters and joiners, there should be a basic connected set for each splitter (joiner) containing its successors (predecessors), which solves the problem outlined above. Similarly, there may also be a basic connected set covering each region of the program graph that is internally SDF.


Figure 3.4: If k2 and k3 are fused into one task, then the entire graph must be fused

3.1.3  Queue sizes

The second problem considered in this chapter is static queue sizing. Double buffering is a well-known technique to overlap communication and computation. There are two situations, however, when a stream ought to be allocated more than two buffers. The first is when a stream covers a long latency or, equivalently, crosses more than one pipeline stage boundary. The second is when there are short-duration load imbalances due to variable computation times or communication rates.

The chain8 benchmark illustrates the first situation, and is shown in the upper part of Figure 3.5. It has eight tasks in a pipeline, with streams between consecutive tasks, and another stream between the first and last tasks. Figure 3.5(a) shows the progress of the first and last tasks relative to the stream between them. The vertical axis is time, and the horizontal axis is the position in the stream. At any given time the producer is working on some interval of the stream, which it owns. It starts at the top left of the plot, at the beginning of both the stream and time, moving to the right when it sends data to the consumer, and continually downward through time. The figure also shows the progress of the consumer. The progress of the consumer is also shown.

The periodic pattern of waiting is caused by the interaction between two dependencies. First, the consumer must wait for its data to arrive, which means that it waits for the producer, plus the latency of the pipeline. This gives a vertical dependency from producer to consumer. Second, the producer must wait for an empty consumer-side buffer in which to send its data, and this gives a horizontal dependency from consumer to producer. The interaction between these dependencies causes the periodic pattern of waiting.


Chain8

(a) 2 buffers
(b) 6 buffers
Figure 3.5: Effect of consumer queue length on chain8 and producer-consumer


Figure 3.5(b) is for six consumer-side buffers, which increases throughput by 73%, and is sufficient for the producer to be always busy. This shows that double buffering was not sufficient, but also that the number of buffers can be less than one plus the difference in pipeline stage, which is the number of buffers allocated by StreamRoller [KM08] and SPIR [CLC+09]; in this case eight.

The second situation is illustrated using the producer-consumer example in the lower part of Figure 3.5. If the producer and consumer both have fixed computation times and communication rates, then double buffering is sufficient. Sometimes, single buffering at one or other end will be enough, even with good load balancing. Figure 3.5(c) shows the progress of this example, using double buffering, when computation times are normally distributed. Increasing the number of consumer buffers to five, as shown in Figure 3.5(d), increases throughput by 20%.

The queue sizing algorithm is based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates.

The inputs to the algorithm are the mapped stream program, a program trace and the machine description, giving the target topology and memory budgets. A simple model of computation times and communication rates, such as independent normal distributions and Poisson arrivals, may be misleading, so the only options are simulation and real execution. The experimental results use coarse-grain simulation, but real execution could be used instead. The output is the buffer size for the producer and consumer on each stream, which may be different.

The performance of the queue length assignment algorithm is quantified using the utilisation, which is the percentage of time that the most heavily loaded processor or bus is busy. Utilisation is proportional to throughput. If the stream graph is acyclic, at least one resource ought to be 100% busy. If any resource has utilisation less than 100%, it must be due to insufficient buffering.

The tradeoff between utilisation and the number of consumer buffers is illustrated in Figure 3.6. Chain has linearly increasing utilisation until it reaches 100%. Producer-consumer achieves 99% utilisation with 3 producer and 4 consumer buffers, and additional buffering yields diminishing returns.

The SPM and StreamIt languages eliminate deadlock, so the objective function depends only on performance and latency. The interaction between bounded memory in process networks and deadlock, but not performance, has been explored in depth [Par95, Buc93, GB03], and these techniques can determine the minimum buffer sizes.

The queue length assignment algorithm is iterative, and consists of a coarse-grain simulator, a cycle detection algorithm, a buffer size update algorithm, and an evaluation algorithm. The cycle detection algorithm analyses metrics from the simulator, and finds a bottleneck cycle. The buffer update algorithm chooses the initial buffer allocation, and adjusts buffer sizes to resolve the bottleneck. The evaluation algorithm monitors progress and decides when to stop, choosing the buffer allocation that achieved the best performance-latency tradeoff.


Figure 3.6: Memory-performance tradeoff


Figure 3.7: The mapping phase of the ACOTES compiler, showing the partitioning and queue sizing algorithms

3.2  Static partitioning

3.2.1  The partitioning problem

The target is represented as an undirected bipartite graph H=(V, E), where V=PI is the set of vertices, a disjoint union of processors, P, and interconnects, I; and E is the set of edges. Each processor, p, has weight, wp, equal to its clock speed in GHz, and each interconnect, u, has weight, wu, equal to its bandwidth in GB/s. The static route between processors p and q is represented by rpqu=1 if it uses interconnect u, and 0 otherwise. In general, rpqurqpu; e.g. dimension-order routing on a mesh. Figure 3.8 shows the topology of our example targets, omitting the edge and vertex weights and the routing table. This representation is a simplified form of the Abstract Streaming Machine (ASM).

The program is represented as a directed acyclic graph, G=(K, S), where K is the set of kernels, and S is the set of streams. If the program is cyclic, then each strongly connected component is contracted into a single vertex. The load of kernel i on processor p, denoted cip, is the mean number of gigacycles in some fixed time period τ. Similarly, the load of stream ij, denoted cij is the mean number of gigabytes transferred in time τ.

The basic connected sets are a collection, C={Cj}, of subsets of K, where each Cj is a set of pairwise connected kernels. A subset LK is connected if, for any pair of kernels k,k′∈ L, there is a sequence k=k1, k2, ⋯, kn=k′, with each kiL and each pair of consecutive kernels, ki and ki+1, connected by being members of some Cj. The whole set of kernels, K, should be connected.

The output of the algorithm is two map functions. Firstly, T maps kernels onto tasks, and secondly, P maps tasks onto processors. The partition implied by T must be convex, so the graph of dependencies between tasks is acyclic.

Let Tp=P−1(p) be the tasks on processor p, and Kt=T−1(t), Kp=∪tTp Kt be the kernels on task t or processor p. The graph of t is the induced graph, Gt=G(Kt), containing the kernels in t and internal streams. The task dependence graph GT is the result of contracting each task in G into a single vertex.

The cost on processor p or interconnect u is

Cp=
 
i∈ Kp
cip
wp
 
Cu=
 
p,q∈ P
rpqu
 
i∈ Kpj∈ Kq
cij
wu
.

The goal is to find the allocation (T,P), which minimises the maximum values of all the Cp and Cu, subject to the convexity and connectedness constraints.

Predicting memory use of tasks

When multiple kernels are fused into one task, the algorithm needs to predict the memory use of the task, given the memory use and composition of each kernel. Finding the minimum memory use is an NP-complete problem [BML96], even ignoring the possibility to overlap the buffers for two or more streams. Not only that, but the partitioning algorithm needs to predict, or at least bound, the memory use from the actual compiler, which is unlikely to be the theoretical minimum.

This algorithm assumes that the combined code size is the sum of the code sizes for the kernels, and that the total memory size is one block, plus the length of the history, for each internal stream, and the original memory size for each external stream. This calculation is orthogonal to the rest of the algorithm.


(a) SMP with three processors(b) 2× 2 mesh
(c) IBM QS20 with two Cell processors (SPEs only)
(d) SMP 3 with accelerator (a1)
Figure 3.8: Topology of the targets used in this section (interconnects are shown as shaded rectangles)

3.2.2  The partitioning algorithm

The partitioning algorithm is split into two phases. The first phase produces an initial partition that is both convex and connected, with at most one task per processor. The second phase, refinement, improves the initial partition, and has some ability to escape from local minima; it can also create multiple tasks per processor.

The first phase could produce a trivial initial partition, which has all kernels in a single task, assuming enough memory. Our results show that the refinement phase still finds a good partition. A good initial partition, however, decreases the total time of the mapping algorithm, since it requires fewer passes of the refinement phase.

The refinement phase uses several algorithms based on Kernighan and Lin’s graph partitioning algorithm [KL70], and is repeated until there is no further improvement. The main step offloads kernels from bottleneck processors, while maintaining connectedness and convexity. If this produces no benefit, then one additional task is created, if enabled, and the new partition is kept if the improvement is larger than some threshold (currently 5%).


(a) First level partition of the target in Figure 3.8(b)
(b) Stream program and first level partition
 
(c) Branch and bound search for first level partition
Figure 3.9: First level partition in the initial partition algorithm

Initial partition

The initial partition is generated by recursively subdividing the target and program graphs into halves, mapping each half separately. This continues until there is either a single kernel, which is mapped to some processor, or a single processor, which executes all kernels.

Partitioning the target

The algorithm first divides the target into two subgraphs, P1 and P2, and an aggregate interconnect, I, balancing two objectives: the subgraphs should have roughly equal total CPU performance, and the aggregate interconnect bandwidth between them should be low. Figure 3.9(a) shows the result of dividing the mesh target from Figure 3.8(b).

The optimal target partition is found as follows. First, the communications bottleneck for uniformly random traffic between P1 and P2 is given by

C = 
 
max
u∈ I
 
p∈ P1q∈ P2
rpqu+rqpu
wu
.     (1)

The target is divided into halves to maximise α, the product of C with the total performance of the less powerful of P1 or P2

α = C min
(
 
p∈ P1
wp
 
q∈ P2
wq)
.     (2)

An approximate solution is found using an adaptation of the Kernighan and Lin partitioning algorithm.

Partitioning the program

The program (sub)graph is given edge and vertex weights. The edge weight for stream ij, denoted cij is the cost in cycles in time τ, if assigned to the aggregate interconnect, rather than internal to P1 or P2. The vertex weight for kernel i is a pair (ciP1, ciP2), the cost of assigning it to P1 or P2, respectively. The goal is to find a two-way partition {K1, K2} to minimise the bottleneck given by

  c = max
 
i∈ K1
ciP1
 
j∈ K2
cjP2
 
i∈ K1j∈ K2
cij)
.     (3)

The partitioning algorithm is a branch and bound search. Each node in the search tree inherits a partial partition (K1,K2), and unassigned vertices X; at the root K1=K2=φ and X=K. It chooses some kernel vX, adjacent to K1 with K1∪{v} convex and connected (or any v if K1 is empty) then switches on either adding v and its ancestors to K1, or v and its descendants to K2. If adding vertices to K1 would cause K2X to become disconnected, then the subtree contains no connected partitions, so is pruned.

Figure 3.9(b) and (c) show a program and its branch and bound search, with each node labelled by its sets K1 and K2. The minimal cost, cK1K2, for all partitions in the subtree rooted at node (K1,K2) is at least as large as the partial sum on the vertices already assigned:

cK1K2 ≥ lb = max
 
i∈ K1
ciA
 
i∈ K2
ciB,
 
i∈ K1j∈ K2
cij)
.     (4)

Any valid partition in the subtree gives an upper bound on the optimal cost in that subtree. Since (K1, XK2) is always valid:

  cK1K2 ≤ ub = max
 
iK1
ciA,
 
i∉ K1
ciB,
 
i∈ K1j∉ K1
cij)
.       (5)

In Figure 3.9(c), the node marked {x, φ} has K1={x} and K2=φ. The known cost on P1 is 5.5, being the cost of kernel x divided by the performance of P1. The known costs on P2 and the interconnect are both zero. Hence by Equation (4), lb=max(5.5, 0, 0) = 5.5. Similarly, by Equation (5), the upper bound on the optimum is the cost of partition {x, yzt}, so ub=29.

The search algorithm tries to quickly find a good partition, so that more of the search tree is pruned by having its lower bound greater than some upper bound. It uses a depth-first search, and chooses vertex v adjacent to K1 (as it must) with the highest cost on whichever processor currently has the greatest load, then first considers adding it to the other processor.

Refinement of the partition

The refinement stage starts with a valid initial partition, and improves it by applying the optimisation passes described below. As shown in Figure 3.7, these steps are applied in sequence, and iterated until no further improvement is seen. The optimisation passes are:

Merge tasks
A greedy algorithm merges low cost tasks and has the effect of freeing processors and reducing communications
Move bottlenecks
The main optimisation pass moves kernels from bottleneck processors
Create tasks
Create a new task to relax the connectedness and convexity constraints, and keep the new partition if the benefit is larger than some threshold
Reallocate tasks
A greedy algorithm improves the allocation of tasks to processors

The passes are described in detail below.

Merge tasks

This step uses a greedy algorithm to merge tasks whose union is convex and connected, as long as it does not cause a new bottleneck. This pass often reduces bus traffic, and frees up processors so they can accept kernels without restriction. Since there are usually far fewer tasks than kernels, define the basic connected sets of tasks: Dj = {T(k): kCj}, where T(k) was defined earlier as the task containing kernel k, and set D={Dj: |Dj|≥ 2}. In this case, the union of T1 and T2 is connected if {T1, T2}⊆ Dj, some DjD.

Section 3.2.1 defined the task dependence graph, GT, as the directed acyclic graph on the tasks. Define d(T1,T2) = 1 if there is a path from T1 to T2 of length two or more, and 0 otherwise. This can be calculated in time O(|T|2), using a topological sort. The greedy algorithm finds, using a branch and bound search, the connected pair of tasks T1 and T2 with minimum total cost on either of their current processors, such that d(T1,T2)=0. If the bottleneck cost after merging is no greater than the current bottleneck cost, then the tasks are merged and allocated to the processor on which they have the minimum total cost. The algorithm continues until no more tasks can be merged.

Move bottlenecks

This pass identifies a bottleneck processor, p1, then considers moving a set M1 of kernels from some task on p1 to a task on another processor, q1, without violating convexity or connectedness. The cost metric to minimise is the maximum of the costs on p1, q1 and all interconnects, after the move:

C = max( Cp1Cq1
 
max
u∈ I
 Cu ).     (6)

This metric excludes the other processors, otherwise if some other processor had the same cost as p1, its contribution would hide the benefit of any move.

Some kernels must be moved, even if doing so has a negative benefit—hence the algorithm has some ability to escape from local minima. After tentatively moving set M1, record the bottleneck cost and identify the new bottleneck processor, p2, which may still be p1, and tentatively move a set M2 to another processor. It continues moving kernels, with the constraint that no kernel can be moved back to a processor that it has previously been allocated to. For instance, none of the kernels in M1 may be tentatively moved back to p1, but they may be moved a second time to another processor. This process continues until either there are no remaining valid moves, or a fixed limit, currently 50 moves, is reached. The final partition is that of the intermediate point in the algorithm with the maximum overall performance.

Any kernel, k, can potentially be moved to any task on a different processor if there is a kernel k′ on the new task that shares a basic connected set with k. There are three additional requirements. Firstly, if k is neither a source nor a sink in its current task, T, then T−{k} cannot be convex. It is always necessary to move either k and all its ancestors in T, or k and all its descendants in T. Secondly, it is necessary to check, using a breadth-first or depth-first search on the basic connectivity sets, whether the remainder of the old task is still connected. Thirdly, there are several ways that the move can create a cycle in the task dependency graph, and this can be checked using a topological sort.

Create tasks

This pass moves a kernel k on a bottleneck processor onto another processor, creating new tasks as necessary to become convex and connected. It then runs the Move Bottlenecks pass, with the restriction that the kernel cannot be moved from its new processor. The new partition is kept if the performance is improved by more than some threshold, currently 5%.

The most expensive kernel, on its current bottleneck processor, is considered first. This kernel may be moved to any processor in use for which the cost of the kernel is less than the current bottleneck cost. There is no advantage in moving a kernel to an unused processor, since that is the first thing that the Move Bottlenecks pass would do. If there are no valid choices, then the second most expensive kernel on the same processor is considered, and so on.

Kernel k is placed on some other processor in order to minimise the sum of the weights of the large kernels, including k, on the new processor; the large kernels are those of weight at least half that of the kernel being moved. The reason for ignoring lightweight kernels is that these are most likely to be able to be easily offloaded onto other processors.

Reallocate tasks

The reallocation pass decreases communications traffic by permuting the loads on the processors. This pass is executed even if the bottleneck is on one of the processors. It only moves tasks between similar processors attached to different buses. Similar processors are those for which all kernel computation times are identical. For instance, it can permute the four processors on the 2× 2 mesh target, or SPEs on different processors on the two-Cell QS20 target.

The algorithm is similar to Kernighan & Lin, in that it swaps similar processors in a greedy manner to minimise the maximum load on the buses, even if doing so makes the bottleneck worse. After swapping the loads on two processors, they are fixed for the rest of the pass. The algorithm continues until there are no processors left, and outputs the best partition seen.


(a) Recursive initial partition(b) Trivial initial partition
Figure 3.10: Convergence of the refinement phase as a function of the number of iterations


 
(a) SMP with three processors
(b) 2× 2 mesh

(c) IBM QS20 (two Cell processors)
(d) SMP 3 with accelerator
Figure 3.11: Normalised execution time for the StreamIt 2.1.1 benchmarks for the three variants of the heuristic algorithm. The unrestricted partition has execution time 1.0, and larger bars are slower.


3*Benchmark30.07@percent
Num. kernels
3*Width   Number of pipeline stages for unrestricted & heuristic
     SMP 32×2 MeshQS20 (2 Cell)SMP 3 + Acc.
     
u
h
  
u
h
  
u
h
  
u
h
bitonic-sort404  
27
(5)
  
35
(7)
  
35
(19)
  
9
(7)
channel5517  
11
(5)
  
9
(7)
  
9
(7)
  
9
(5)
dct81  
9
(5)
  
9
(7)
  
9
(9)
  
13
(5)
des533            
77
(31)
     
fft171  
13
(5)
  
27
(7)
  
33
(23)
  
17
(7)
filterbank8516  
19
(5)
  
19
(7)
  
23
(9)
  
19
(5)
fm4312  
9
(5)
  
17
(5)
  
21
(9)
  
17
(5)
mpeg2235  
23
(5)
  
19
(7)
  
29
(19)
  
15
(7)
radar5712  
13
(5)
  
9
(5)
  
19
13
  
19
(5)
serpent1202  
143
(5)
  
163
(7)
  
209
(31)
     
tde291  
39
(5)
  
41
(7)
  
57
33
  
23
(9)
vocoder11417  
29
7
  
29
(7)
  
29
(7)
  
45
11
Average ratio    5.95.02.42.8
Table 3.1: The number of pipeline stages for the optimal unrestricted partitions and the partitions generated by our heuristic (loosely connected with pipelining). Partitions that do not need pipelining are in parentheses.


3.2.3  Evaluation

This section uses the StreamIt 2.1.1 benchmarks [GTA06] to evaluate our heuristic algorithm and convex connected partitions in general. The StreamIt benchmarks have the two-terminal series-parallel structure of StreamIt, but are the most widely used streaming benchmarks. The program graph, work estimates and data rates were taken from the StreamIt 2.1.1 compiler. The StreamIt compiler modifies the stream program graph before calculating the work estimates, so our kernel counts differ from those of the source program. The number of kernels ranges from 8 to 120, and has average 54.

Figure 3.10 shows performance vs. iteration in the refinement phase. At each point is plotted the minimum of all partitions seen so far. This graph shows that the refinement algorithm quickly converges to a good solution, even from a trivial initial partition. It also shows that the initial partition is on average within 4% of the performance of the final partition, although the worst case is 33% slower. The sub-figures have very different scales on the vertical axes. The graphs show the number of iterations rather than time, since the algorithm was implemented in unoptimised Python. Nevertheless, the per benchmark partitioning time on a 2GHz Intel MacBook is average 10.0 seconds and maximum of 58.4 seconds.

Figure 3.11 shows the normalised execution time for the partitions found by the heuristic, using strict and loose connectivity, against the optimal unrestricted partition, which has time 1.0. The strictly connected partitions for channel, filterbank, fm and radar have bad performance because of the wide split joins. The third column of Table 3.1 gives the width of each benchmark, which is the maximum size of a subset of kernels with no paths between any pair of them (an anti-chain). For example, the filterbank benchmark, illustrated in Figure 3.1, has a width of 16. The benchmarks with poor performance using strict connectivity tend to be the benchmarks with the largest width, although this is a great simplification.

Figure 3.11 also shows the bottleneck cost for the loosely connected partition generated by our heuristic when software pipelining is enabled. Software pipelining is most beneficial for the vocoder benchmark on SMP3 with accelerator, and radar on IBM QS20. Figure 3.12 shows the partition of vocoder on SMP with accelerator found using the heuristic, which uses eleven pipeline stages if scheduled using the stage assignment phase of the SGMS algorithm [KM08]: six for computation and five for DMA. The main improvement in throughput comes from splitting the workload on a1 into two tasks. This benchmark has two heavyweight kernels that should run on the accelerator, but the cost of the smallest convex task containing both of them is very large. The optimal unrestricted partition is 9% faster, but it requires 45 pipeline stages.

The only bad results are des, serpent, and tde on IBM QS20. These three suffer since our fast algorithm gets stuck in a local optimum. These benchmarks suggest some limitations of the Create Task stage, which could be addressed in future work. The des and serpent benchmarks finish with two or more tasks of identical cost. The Create Task stage offloads a kernel from one of the bottleneck tasks, but since it does not reduce the cost of the other bottleneck(s), the move is rejected as not worthwhile. The tde benchmark finishes with the bottleneck task containing a heavyweight kernel. The Create Task stage tries to move the heavyweight kernel to another processor, but it cannot find a better solution. It may have been better to move some of the lighter kernels from the bottleneck task. Since the problem is NP-hard, all heuristics will at times find suboptimal solutions.

Table 3.1 shows the pipeline lengths for the unrestricted and heuristic partitions. The pipelines were generated using SGMS [KM08], so half of the pipeline stages perform computation. The partitions that do not require software pipelines are given in parentheses. Some of the cells for the unrestricted partitions are empty, since finding the true optimum is slow (Figure 3.11 uses a lower bound). The unrestricted partitions for serpent and tde on Cell have long software pipelines because the programs themselves are long pipelines, and unless it is bus bound, there is no incentive for short pipelines. There are four data points in Figure 3.11 where our heuristic algorithm used software pipelining.


Figure 3.12: Vocoder benchmark on SMP 3 with accelerator using software pipelining.

3.3  Static buffer sizing

3.3.1  The buffer sizing problem

Queue length assignment seeks to find an optimal tradeoff, subject to memory constraints, between throughput and latency. We wish to find a close to Pareto optimal solution: that is, neither latency nor throughput can be improved without making the other one worse. Memory use is kept within the constraints, rather than being minimised.

The stream program is represented as a connected, not necessarily acyclic, digraph, P=(T,S), where T is the set of vertices (tasks), and S is the set of edges (streams). Each stream s has a producer and consumer buffer size in bytes, bp(s) and bc(s), and a minimum number of buffers, sufficient to hold the working set and avoid deadlocks. If P is acyclic, as for ACOTES, deadlock is impossible; otherwise minimum sizes can be found using algorithms in the literature [Par95, Buc93, GB03]. The algorithm determines the actual number of buffers, np(s) and nc(s).

Each task has a trace, which is an alternating sequence of computation times and primitives. The are four communications primitives given in Section 2.3: ProducerAcquire and ConsumerAcquire obtain a buffer to write or read data. ProducerSend and ConsumerDiscard send or discard data once complete.

The traces are interpreted using the ASM coarse-grain simulator, which takes a machine description that defines the target. Queue length assignment needs only the memory constraints, which are represented using a bipartite graph, H=(R,E). The set of vertices, R=PM, is a disjoint union of processors P and memories M, and the edges, E, connect processors to their local memories. Each memory has weight equal to the amount of memory available, in bytes, for stream buffers. Figure 3.13 shows the memory constraint graph for the Cell Broadband Engine; the memory weights depend on how much memory is already being used.


Figure 3.13: Memory constraint graph for the Cell Broadband Engine

The evaluation algorithm in Figure 3.7 and experimental results in Section 3.3.3 both require an estimate of latency. Since it is orthogonal to the rest of this section, and only differences in latency matter, we propose the following scheme.

Define ft(n) to be the time of firing, n=0, 1, ⋯, Mt−1 of task t, taken from the fire primitive. Since each task contributes to a common amount of real-world progress, normalise n to the interval 0 ≤ x < 1 by dividing it by Mt. Then gt(x) = ft(⌊ Mt x⌋) gives the time that task t was proportion x∈ [0,1) through the calculation. The latency, L(x), is the difference between the largest gt(x) for a sink and the smallest gt(x) for a source, which can, unfortunately, be negative when multiplicities are variable. The latency is the average value of L(x).

3.3.2  The buffer sizing algorithm

This section describes several algorithms for cycle detection and buffer size update. It first reviews the standard critical cycle detection algorithm, and explains when it is applicable. It then introduces the baseline algorithm, which finds the bottleneck cycle by analysing the time each task is blocked on each stream. This data is easy to obtain, and the algorithm is quite effective. It then gives an example that the baseline algorithm gets wrong, and proposes the token algorithm, which requires extra bookkeeping but achieves better results. Finally, it describes several variants on the buffer update algorithm, which have different tradeoffs between speed of convergence and latency.


StyleWaiting primitive (§3.3.1)
BoldProducerAcquire
DashedConsumerAcquire
SolidProducerSend
DottedComputation
(a) Timed event graph
(b) Types of edge
Figure 3.14: Example timed event graph used by the critical cycle algorithm

Cycle detection algorithms

Critical cycle algorithm

The critical cycle algorithm [IP95, DG98, GG93] solves the cycle detection problem for homogeneous Synchronous Data Flow (SDF) [LM87] with constant computation times and communications latencies. In homogeneous SDF, every time a producer or consumer fires, it pushes or pops a single buffer on each stream. All tasks therefore fire at the same rate. The algorithm can be extended to SDF, where each producer or consumer pushes or pops any fixed number of buffers, but it requires expanding the graph, which can make it much bigger [Lee86].

Figure 3.14(a) shows how producer-consumer, assuming a single buffer at each end, is represented by this algorithm. Each vertex is the return from a communications primitive. The edges are distinguished, for the diagram but not the algorithm, using the convention in Figure 3.14(b), which refers to the primitives in Section 2.3. Each edge has weight, which is its fixed computation time or communications latency, and height, which is the fixed difference between the firing number, which counts the number of times a task has fired, at its two ends.

For example, at the producer side, the dotted line from ProducerAcquire to ProducerSend, of weight 448 and height 0, represents computation inside a single iteration. The bold line in the reverse direction, of weight 13 and height 1, is because the producer cannot reuse its single buffer in the current firing until the previous DMA has completed.

Throughput is constrained by the critical cycle, which is a cycle with maximum ratio of total weight divided by total height. There are several algorithms to find such a cycle, many based on Karp’s Theorem [Kar78], in time O(|S|2|T|) or so [DG98], using the terminology of Section 3.3.1.

Baseline algorithm

Our baseline algorithm is more general, because it supports variable data rates, computation times, and communication latencies. It finds the bottleneck by analysing wait times in a real execution or simulation.

Figure 3.15 shows how the stream program and wait times are represented by the algorithm. Figure 3.15(a) is an example stream graph with three tasks in a triangle. Figure 3.15(b) is the wait-for graph, which has the same three edges per stream as the timed event graph. Following convention for wait-for graphs, the arrows point in the opposite direction, from the waiting task. The weight of an edge is the proportion of the total time that the task at the initial vertex, or tail, spent waiting in its communications primitive. The diagram shows three of the edge weights; the other weights will not be important in the discussion.


(a) Program
(b) Wait-for graph
(c) (t0) has zero strength
Figure 3.15: Example weighted wait-for graphs

As for the critical cycle algorithm, performance is constrained by dependence cycles in the wait-for graph. The algorithm uses two bounds, one local and one global, on the maximum increase in performance from relaxing a cycle; i.e. increasing buffering on one of the streams in the cycle that gets full.

Consider the potential benefit from relaxing cycle C1 = (t0 t2 t1). This can only be done by increasing buffering on the stream from t0 to t2. Since t1 waits for 27% of the time, during the ConsumerAcquire primitive in this cycle, we could reduce the execution time of t1 by at most 27%, before the cycle disappears. Since all tasks execute for nearly the same amount of wallclock time, any change in throughput will cause all vertices to have their total waiting time, not just on the edges of this cycle, reduced by the same amount. It is therefore likely that the edge in the cycle that disappears first is its weakest edge.

The local bound is the weight of cycle C, denoted w(C), which is the minimum weight of its edges. If there is no cycle with non-zero weight, then utilisation is already 100%. This is because every directed acyclic graph has a vertex with no outgoing edge, which corresponds to a task that never has to wait.

Figure 3.15(c) is the motivation for the global bound. The maximum weight cycle is the loop on t0, of weight 0.13, which we will call C2. A moment’s reflection, however, shows that C2 cannot really be a bottleneck since neither t1 nor t2 ever wait for t0, even indirectly. If we reduced the time t0 spent waiting on this loop, it cannot make t1 or t2 go any faster. Since throughput would be unchanged, t0 must spend the same total amount of time waiting, so the waiting time would move from ProducerAcquire to ProducerSend (see Figure 3.14(b)).

The global bound is the strength of the cycle, denoted s(C), which is the lowest value of the maximum flow through a single path to the cycle, starting from any other vertex. Since there is no path at all from t1 to C2 in Figure 3.15, the cycle has zero strength: s(C2)=0. In contrast, the cycle (t1 t2) has strength 0.77, because this is the weight of the only path from the only other vertex, t0. Increasing the performance of t1 and t2 by any means could reduce execution time of the program as a whole by 77%. This cycle is the bottleneck, and it has weight 0.05. The requirement that flow be through a single path makes little difference in practice, but it reduces considerably the algorithmic complexity.

It is possible for the wait-for graph to be disconnected; e.g. when tasks wait for each other only through bus contention. This happens rarely, but it causes all strengths to be zero. Therefore, when all strengths are zero but the utilisation is below some threshold (currently 100%), the strengths are ignored. Since it almost never happens, there is little reason to be more sophisticated.

The strength of each vertex is found by computing the all-pairs bottleneck paths [Pol60]. This finds, for every pair of vertices, the value of the maximum flow through a single path from the first vertex to the second. It is solved using a variant of Dijkstra’s algorithm, running Dijkstra for each vertex to find the maximum flow paths into it. The strength of that vertex is given by the path with the lowest flow. The total execution time is O(|S||T| + |T|2log|T|), using a Fibonacci heap [FT87b, VWY07], with the terminology of Section 3.3.1.

The algorithm finds a cycle with the maximum value of the minimum of the local and global bounds. It is straightforward to show that we can take account of both simply by replacing the weight of every edge e=(a,b) by a new weight, w′(e) = min( w(e), s(a)). A maximum weight cycle, according to w′, can be found in time O(|S|log|S|), where S is the set of streams. To find out whether there is a cycle of weight ≥ W, for some W, just check whether there is any cycle if you ignore all edges of weight <W. This can be done in time O(|S|) by attempting to perform a topological sort. To find a maximum weight cycle, first sort the edge weights, and perturb them so that no two are exactly the same. Then use bisection on the sorted edge weights.

The baseline algorithm uses data that is easy to obtain, and is usually quite effective, but it has one limitation. Since each task is represented by a single vertex, it cannot “see” what is happening inside them.

Figure 3.16(a) shows the wait-for graph for an example where the baseline algorithm makes a bad decision. The maximum weight cycle is (t1 t0 t2), which has weight 0.50. Whether or not this is a bottleneck depends on the internal behaviour of tasks t1 and t2. The order of operations per firing of task t1 is shown in Figure 3.16(b). If it is also known that task t1 always waits in step 5, then reducing the waiting time in step 1 will simply result in a longer waiting time in step 5. It can never advance the push in step 6, so the critical cycle cannot be (t1 t0 t2). The next section introduces the token algorithm, which addresses this problem, and describes the indirect wait-for graph in Figure 3.16(c).


(a) Wait-for graph
 Primitive
Wait time
1.ConsumerAcquire(s01, 0)
0.52
2.ProducerAcquire(s13, 0)
n/a
3.ProducerAcquire(s12, 0)
n/a
4.ConsumerDiscard(s01)
n/a
5.ProducerSend(s13)
0.48
6.ProducerSend(s12)
n/a
(b) Order of primitives in t1
(c) Indirect wait-for graph
Figure 3.16: Example where baseline fails

Token algorithm

The token algorithm addresses this problem by tracking dependencies through tasks. This is somewhat similar to causal chains [BH01], except that the aim is to resolve performance bottlenecks rather than artificial deadlocks. Their algorithm fixes a deadlock after it happens, when all tasks have got stuck, but we cannot expect all tasks in a cycle to ever be waiting simultaneously.

During the simulation, or at runtime in a dynamic scheme, each task t has a current token, St, which is the stream that most recently made t wait, directly or indirectly, because it got full. It has a current waiting time, Wt, which measures how much the task has already had to wait, so that only increases in waiting times are charged to streams. It also has a waiting vector, (Vt)s, which gives the total waiting time for each stream in the whole program. Each consumer buffer c has a current token, Sc, and current waiting time, Wc, which together record the producer’s problem at the time the block in that buffer was sent.

When task p blocks for time τ because output stream s is full, it sets Sps and increases both Wp and Vp[s] by τ. When task p sends a block using buffer c on output stream s, it records a copy of its current state: ScSp and WcWp. When a task q blocks for time τ because input stream s is empty, it also, after the data arrives, reads Sc and Wc, from the consumer buffer c containing the end of the data. It then updates its current token SqSc to indicate that it had to wait, indirectly, for whichever stream the producer had to wait for, and calculates the increase in current waiting time Δ Wq ← min(τ, WcWq), which can be either positive or negative. If it is positive, then Vq[Sq] is increased by Δ Wq. In either case, the current waiting time is then updated using WqWq + Δ Wq.

The waiting vectors are used to construct an indirect wait-for graph, as shown in Figure 3.16(c). If Vt[s]>0, there is an edge from task t to stream s with weight Vt[s] / L, where L is the total execution time of the run, in the same units. Each stream s also produces an edge from s to its consumer q. The weight of this edge is s(q), the strength of q, as defined for the baseline algorithm.

This is effectively viewing each stream as an actor in its own right, which is always blocked waiting for the consumer to discard its data. This is the most convenient place to take account of the strengths, which are still relevant by the same argument as before. The token algorithm finds the maximum weight cycle in the same way as the baseline algorithm.

Figure 3.17 shows a second example which clarifies the need for the cycle-based algorithm outlined above. In the stream program of Figure 3.17(a), task t0 pushes the outputs in the cyclic order (s01 s03 s04 s06), waiting only in ProducerSend for streams s03 and s06 due to their longer latency.


(a) Stream graph for bichain4
(b) Indirect wait-for graph
Figure 3.17: Token algorithm: bichain4 example

When it pushes on stream s04 of the right branch, the most recent wait was due to stream s03 being full, so it sends the token for s03. Similarly, it sends the token for stream s06 to stream s01 of the left branch. The indirect wait-for graph is shown in Figure 3.17(b), with cycle (t3 s06 t6 s03) going through both streams.

Buffer size update algorithms

The cycle detection algorithm returns a set of edges in the wait-for graph that cause a bottleneck cycle by becoming full. Relaxing the cycle involves increasing memory on one or more of these edges. The purpose of the buffer size update algorithm is to determine which edges to enlarge, and by how many buffers.

Our simplest algorithm is miserly, meaning that it starts at the minimum number of buffers, mentioned in Section 3.3.1, and each iteration increases the allocation of a single buffer by one. The other algorithms speculatively assign spare memory, and only take it away if it is needed elsewhere. For all these algorithms, each stream s demands some number ds of buffers, as for the miserly algorithm, and requests another rs to be granted out of unused memory, if there is any. When there is not enough memory to grant all requests within some memory, we used the following algorithm. The total request in bytes is R=∑rs bc(s), where bc(s) is the size in bytes of a single consumer buffer for stream s. If M bytes are left after granting all demands, so R>M, then each stream is initially granted ⌊ rj M / R ⌋ extra buffers, then possibly one more, if it fits.

In our first alternative, double, each edge requests an extra buffer if it is currently allocated only one. In our second alternative, exponential, the request is for some multiple, f−1, of the number of buffers demanded. It still uses a greedy update algorithm, so that when the number of buffers is increased, the edge demands, on the next iteration, one more buffer than it was given in total last time. The results use f=2, so an edge will demand 2k − 1 buffers, and request an equal number, for k=1, 2, ⋯, until it is given fewer buffers than it wants.

The third alternative, level, uses the top level, the length of the longest path from a source node, and bottom level, the length of the longest path to a sink node. The algorithm the same as exponential, except that the request is the maximum of a) f−1 times the number of buffers demanded, b) twice the difference in top level, and c) twice the difference in bottom level. This tries to give a high initial allocation to streams that cross a high latency.

3.3.3  Evaluation

This section uses the StreamIt 2.1.1 benchmarks [GTA06], random graphs, and sixteen examples, including chain8, producer-consumer, bad-baseline, and bichain4. For the StreamIt benchmarks, the program graph, work estimates and communications rates were generated by the StreamIt compiler. The algorithm in Section 3.2 was used to produce partitions for an IBM QS20 blade, which has two Cell BEs.

Buffer size update The first three rows of Figure 3.18 compare the buffer update algorithms from Section 3.3.2. These plots also contain results for Basten and Hoogerbrugge (B&H) [BH01] and modified StreamRoller [KM08], which will be discussed in Section 3.4. The left column shows as a function of the iteration number, the utilisation, which is proportional to throughput, as remarked at the end of Section 3.1.3. The right column shows the tradeoff between latency and utilisation. Any points that cannot be Pareto optimal, because they are beaten on both utilisation and latency by some other point to the top-left, have been removed.

The first row is for random stochastic graphs with 32 tasks and 50 streams. The graphs are connected and acyclic, but otherwise unstructured. The computation time of each task is normally distributed with a random mean and variance (clamped above zero). Notice that B&H has poor performance and, since it increases buffering where it isn’t necessary, high latency.

The upper bound on utilisation was found using an exhaustive search over all allocations of the buffers on the processor, p, whose memory bound caused the level algorithm to terminate. All other queues on other processors were set to their maximum possible size, assuming that all other queues in the same memory had their minimum size. Since this tends to allow a task near the beginning of the stream graph to work flat out filling downstream buffers, the steady state utilisation would be known only after many firings. Instead, we took the utilisation of the task on p, and scaled by the ratio of the long-term processing times of the most heavily loaded processor and of p.

The second row shows the StreamIt 2.1.1 benchmarks, with an unroll factor of 100. The third row shows the stochastic StreamIt benchmarks, which have normally-distributed computation times, and are intended to show how the algorithms fare for realistic program graphs.

The left column shows that the level algorithm always provides the fastest convergence. The modified StreamRoller algorithm is similar to the first iteration of the level algorithm, and B&H is considerably worse. The level heuristic initial allocation is within 15% of the upper bound on optimal performance, and is increased to within 3% of optimal after four iterations.

Cycle detection This section evaluates the cycle detection algorithms only, using greedy buffer update without memory constraints. When task execution times and communications rates are constant, and bus contention is negligible, the critical cycle algorithm of Section 3.3.2 is optimal. The last row of Figure 3.18 shows the utilisation and latency for an average of six random graphs with stochastic computation times. The poor performance of the critical cycle algorithm (about 60% utilisation), is because it is unable to detect cycles that arise from execution time variability. The baseline and token algorithms achieve similar performance, although the token algorithm achieves slightly lower latency.


 
Utilisation vs iteration number
Utilisation-latency tradeoff
 Buffer size update
Stochastic random G(32,50)
 
StreamIt on 2-Cell
 
Stochastic StreamIt on 2-Cell
 
 Cycle detection
Stochastic G(8,12)
Figure 3.18: Comparison of the buffer size update and cycle detection algorithms

We also evaluated the cycle detection algorithms when there is high bus utilisation. The critical cycle algorithm cannot model increased communication latency due to contention [HP07, §E.5]. For a benchmark with a single producer task connected to two consumers, and bus usage close to 100%, the critical cycle algorithm achieves about 70% utilisation. The baseline and token algorithms measure waiting times directly, and consistently achieve 100% utilisation.

3.4  Related work

Partitioning There has been a great deal of work in automatically mapping stream programs onto multiprocessor systems. The Ptolemy II software environment [EJL+03] is an actor-based model for real-time embedded systems that supports several models of computation, including Synchronous Dataflow (SDF) and Kahn Process Networks (KPN). Related work from the Ptolemy project explores the more theoretical aspects of partitioning and scheduling data flow graphs for multiprocessors [HL91].

The Stream Graph Modulo Scheduling (SGMS) algorithm is part of StreamRoller [KM08], a StreamIt compiler for the Cell Architecture. This algorithm splits stateless kernels, partitions the graph, and statically schedules. The splitting and partitioning problem is translated into an Integer Linear Programming (ILP) problem, which is solved using CPLEX [ILO]. This approach uses mature technology to solve the ILP problem; it also applies kernel splitting in the same step, rather than using the iterative approach we follow.

Their partitioning algorithm considers only CPU loads, and ignores communications bandwidth. This may be sufficient for a single Cell processor, which has a high-bandwidth on-chip bus, but it is inappropriate when communication is off-chip, as in the Cell QS20 target, or when a bottleneck may appear in part of an on-chip network, such as a large mesh.

The StreamRoller ILP formulation does not attempt to find a partition that minimises the memory, latency and startup overheads introduced by software pipelining. Since it uses an ILP solver to find a (close to) optimal solution to a problem with similar objective and constraints to our unrestricted partition, the resulting pipeline length should be similar. StreamRoller does not have any concept similar to our connectivity constraint. We believe that when the program is written using an unrestricted programming language, the partitioning algorithm requires some mechanism to model which kernels can be statically scheduled by the compiler. They do not restrict the memory footprint on each processor, although it appears that their ILP formulation could be extended to do so.

Flextream [HCK+09] uses a related algorithm, adaptive stream graph modulo scheduling to map a stream program to the Cell Architecture. It is a hybrid static-dynamic approach which is able to adapt to changes in resource availability. The static work partitioning algorithm uses ILP to map the program to the most powerful target; e.g. the full machine. The dynamic partition refinement stage is a heuristic that adapts the partition at run time to take account of the resources that are actually available; e.g. if another application is running.

The StreamIt compiler [GMA+02, GTA06] targets the Raw Microprocessor [WTS+97], symmetric multicore architectures, and clusters of workstations. This is a long running project with a publicly available compiler and benchmark suite. The StreamIt source language imposes a structure on the stream program graph, where each kernel has a single input and a single output, and kernels are composed in pipelines, split-joins, and feedback loops. Since the kernels have static data rates, the compiler can fuse any set of kernels. The default partitioner uses dynamic programming. Our model of the source program is more general, since we target unstructured program graphs with variable data rates, and we use the connectedness constraint to reason about the capabilities of the compiler. Our model of the target system is also more general, since we can target a heterogeneous multiprocessor system with any communications topology.

Liao et al. [LDWL06] use affine partitioning to map regular multidimensional programs written using the Brook language [Buc03] onto a four-processor SMP. The R-Stream compiler (www.reservoir.com/r-stream.php) is a proprietary high level compiler for stream programs, which uses a polyhedral model to partition code and data to a parametric parallel machine. Gedae [LBS] is a proprietary GUI tool for mapping data flow graphs to a heterogeneous multiprocessor system. The transformations are under user control, and the partition is not automatically found by the compiler.

Decoupled Software Pipelining (DSWP) [RVVA04] is a technique to tolerate variable latency instructions in loops. It breaks a loop into strongly connected components, which execute on different threads. The threads communicate using a synchronization array, a hardware structure that provides low overhead blocking queues between threads. DSWP can also exploit fine-grained parallelism, by mapping the strongly connected components to different cores.

Queue sizes Basten and Hoogerbrugge (B&H) [BH01] is the only other work that also targets unstructured graphs with variable multiplicities and computation times. Their algorithm sets each FIFO buffer size to be proportional to the amount of data streaming through it. This gives a relative size for each buffer, but it is not motivated by the underlying problems discussed in Section 3.1.3, and has poor performance in Figure 3.18. We interpreted B&H to mean double buffering on the producer side, with all the remaining memory allocated to consumer buffers, rounding the number of buffers up to an integer. If rounding up causes the buffer allocation to not fit, we reduced the target memory use until it did fit. The chain8 example in Figure 3.5 shows the problem with this heuristic. If all data rates are the same and there is enough memory on tn for ten buffers, Basten and Hoogerbrugge allocates five buffers to each stream for 70% utilisation, while our heuristic allocates eight to (t1,tn) and two to (tn−1,tn) for 100% utilisation.

The SDF tool [SGB06] uses an exhaustive search to find all Pareto-optimal buffer allocations for an SDF graph. It requires exponentially many steps, and only supports constant computation times and data rates. For an n-way split or join where each stream needs b buffers, their algorithm requires nb steps, while our level algorithm requires O(nlog2 b) steps to find a single solution.

StreamRoller [KM08] performs buffer allocation as part of software pipelining, but it is restricted to graphs with fixed multiplicities and computation times. The algorithm is similar to the first iteration of the level algorithm, in that the number of buffers allocated to a stream is always one plus the difference in pipeline stage. The chain8 example in Section 3.1.3 shows that this is conservative, even when there is no variability. Hence the StreamRoller algorithm can require more memory than necessary; if there is insufficient memory, it fails.

Due to the unrolling factor we used, StreamRoller failed on at least one benchmark for all of the graphs in Figure 3.18. This is true even for the StreamIt benchmarks, for which our algorithm achieves 100% utilisation on at least one processor. We modified StreamRoller to use our arbitration scheme described in Subsection 3.3.2, and obtained the results shown in Figure 3.18. Even with this modification, however, our iterative algorithm has about 13% higher performance for the stochastic random graphs and stochastic StreamIt benchmarks.

The SPIR compiler [CLC+09] extends StreamRoller to find a partition and software pipeline subject to memory and latency constraints. Unlike our approach, computation times and communication rates are constant. As for StreamRoller, the number of buffers allocated to a stream is one plus the difference in pipeline stage. Since the problem cannot be solved exactly using ILP, it is a heuristic which uses two passes of the commercial CPLEX ILP solver. Our algorithm could be used to improve the buffer allocation of a partition produced by SPIR.

3.5  Conclusions

This chapter introduced a new partitioning heuristic for stream programs. Unlike previous work, it takes account of the partition’s effect on software pipelining and buffer allocation. The algorithm controls the length of the pipeline using the convexity constraint. Unlike previous work, it has a flexible mechanism to take account of the compiler’s ability to fuse kernels.

This chapter also introduced a queue sizing algorithm, which allocates the memory in local stores to streams. Unlike several previous algorithms, it supports streaming programs with non-constant multiplicities and variable computation times. It also achieves higher performance and lower latency than previous algorithms.


Previous Up Next