References

[ACD74]: T.L. Adam, K.M. Chandy, and JR Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 17(12):690, 1974.
[ACO]: ACOTES IST-034869. http://www.hitech-projects.com/euprojects/ACOTES/. Advanced Compiler Technologies for Embedded Streaming.
[ACO08]: ACOTES. IST ACOTES Project Deliverable D2.2 Report on Streaming Programming Model and Abstract Streaming Machine Description Final Version, 2008.
[AG02]: S.V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66–76, 2002.
[AHSW62]: J.P. Anderson, S.A. Hoffman, J. Shifman, and R.J. Williams. D825-a multiple-computer system for command & control. In Proceedings of the December 4-6, 1962, fall joint computer conference, pages 86–96. ACM, 1962.
[AK02]: R. Allen and K. Kennedy. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers, 2002.
[Amd67]: G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, pages 483–485. ACM New York, NY, USA, 1967.
[App]: Apple, Inc. http://developer.apple.com/library/mac/#featuredarticles/BlocksGCD/index.html. Introducing Blocks and Grand Central Dispatch.
[ASRV07]: M. Alvarez, E. Salami, A. Ramirez, and M. Valero. HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications. In IISWC 2007, pages 120–125, 2007.
[ATN10]: Cédric Augonnet, Samuel Thibault, and Raymond Namyst. StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines. Research Report RR-7240, INRIA, 03 2010.
[ATNW09]: C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Euro-Par 2009 Parallel Processing, pages 863–874, 2009.
[Bar08]: Barcelona Supercomputing Center. SMP Superscalar (SMPSs) User’s Manual Version 2.0, 2008.
[Bar09]: Barcelona Supercomputing Center. Cell Superscalar (CellSs) User’s Manual Version 2.2, 2009.
[BDG⁺04]: J. Balart, A. Duran, M. Gonzalez, X. Martorell, E. Ayguade, and J. Labarta. Nanos Mercurium: a Research Compiler for OpenMP. In Proceedings of the European Workshop on OpenMP, volume 2004, 2004.
[Ben]: Eli Bendersky. http://code.google.com/p/pycparser/. pycparser.
[BGH⁺90]: J.C. Bier, E.E. Goei, W.H. Ho, P.D. Lapsley, M.P. O’Reilly, G.C. Sih, and E.A. Lee. Gabriel: a design environment for dsp. Micro, IEEE, 10(5):28–45, October 1990.
[BH01]: T. Basten and J. Hoogerbrugge. Efficient execution of process networks. Communicating Process Architectures, 2001.
[BJK⁺95]: R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. ACM SigPlan Notices, 30(8):207–216, 1995.
[BKSS02]: M.D. Beynon, T. Kurc, A. Sussman, and J. Saltz. Optimizing execution of component-based applications using group instances. Future Generation Computer Systems, 18(4):435–448, 2002.
[BML96]: S.S. Battacharyya, P.K. Murthy, and E.A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Pub, 1996.
[Bow69]: Sr. Bowdon, E.K. Priority assignment in a network of computers. Computers, IEEE Transactions on, C-18(11):1021–1026, November 1969.
[BPBL06]: P. Bellens, J.M. Perez, R.M. Badia, and J. Labarta. CellSs: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. ACM New York, NY, USA, 2006.
[Buc93]: J.T. Buck. Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD thesis, University of California, 1993.
[Buc03]: I. Buck. Brook Spec v0. 2, 2003.
[CAG06]: CAG MIT. StreamIt Language Specification, Version 2.1, 2006.
[CCG⁺00]: J. Chaoui, K. Cyr, J.P. Giacalone, S. Gregorio, Y. Masse, Y. Muthusamy, T. Spits, M. Budagavi, and J. Webb. OMAP: Enabling Multimedia Applications in Third Generation (3G) Wireless Terminals. SWPA001, December, 2000.
[CEP]: CEPBA. http://www.cepba.upc.edu/paraver/. Paraver performance visualization and analysis tool.
[CEP01]: CEPBA. Paraver Version 3.0 Parallel Program Visualization and Analysis tool: Tracefile Description, 2001.
[CGT04]: A. Cohen, S. Girbal, and O. Temam. A polyhedral approach to ease the composition of program transformations. Lecture notes in computer science, pages 292–303, 2004.
[CHM95]: C. Chekuri, W. Hasan, and R. Motwani. Scheduling problems in parallel query optimization. In Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 255–265. ACM, 1995.
[CLC⁺09]: Y. Choi, Y. Lin, N. Chong, S. Mahlke, and T. Mudge. Stream Compilation for Real-Time Embedded Multicore Systems. In Proceedings of the 2009 International Symposium on Code Generation and Optimization, pages 210–220. IEEE Computer Society Washington, DC, USA, 2009.
[CRA09a]: Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade. Mapping stream programs onto heterogeneous multiprocessor systems. In CASES ’09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pages 57–66, 2009.
[CRA09b]: Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade. The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors. In SAMOS Workshop, pages 12–23. Springer, 2009.
[CRA10a]: P. Carpenter, A. Ramirez, and E. Ayguade. Starsscheck: A Tool to Find Errors in Task-Based Parallel Programs. Euro-Par 2010-Parallel Processing, pages 2–13, 2010.
[CRA10b]: Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade. Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors. In High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, pages 96–110. Springer, 2010.
[CRA11]: Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade. The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors. Transactions on HiPEAC, 5(3), 2011.
[CRDI05]: T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband Engine Architecture and its first implementation. IBM developerWorks, 2005.
[CRM⁺07]: Paul Carpenter, David Rodenas, Xavier Martorell, Alejandro Ramirez, and Eduard Ayguadé. A streaming machine description and programming model. Proc. of the International Symposium on Systems, Architectures, Modeling and Simulation, Samos, Greece, July 16-19, 2007, 2007.
[CSB⁺11]: Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. A Domain-Specific Approach To Heterogeneous Parallelism. In 16th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, San Antonio, TX, February 2011.
[DFA⁺09]: A. Duran, R. Ferrer, E. Ayguadé, R.M. Badia, and J. Labarta. A proposal to extend the openmp tasking model with dependent tasks. International Journal of Parallel Programming, 37(3):292–305, 2009.
[DG98]: A. Dasdan and RK Gupta. Faster maximum and minimum mean cycle algorithms for system-performance analysis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 17(10):889–899, 1998.
[DG08]: J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[DYDS⁺10]: M. Duranton, S. Yehia, B. De Sutter, K. De Bosschere, A. Cohen, B. Falsafi, G. Gaydadjiev, M. Katevenis, J. Maebe, H. Munk, et al. The HiPEAC vision. Network of Excellence of High Performance and Embedded Architecture and Compilation, Tech. Rep, 2010.
[EJL⁺03]: J. Eker, J.W. Janneck, E.A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming heterogeneity—the Ptolemy approach. Proceedings of the IEEE, 91(1):127–144, 2003.
[ERB⁺10]: Yoav Etsion, Alex Ramirez, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. Task superscalar: Using processors as functional units. In Hot Topics in Parallelism (HotPar), Jun 2010.
[ERL90]: Hesham El-Rewini and T. G. Lewis. Scheduling parallel program tasks onto arbitrary target machines. J. Parallel Distrib. Comput., 9:138–153, June 1990.
[ESD02]: M. Ekman, P. Stenström, and F. Dahlgren. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In Proceedings of the 2002 international symposium on Low power electronics and design, pages 243–246. ACM, 2002.
[FC07]: G. Fursin and A. Cohen. Building a Practical Iterative Interactive Compiler. In 1st Workshop on Statistical and Machine Learning Approaches Applied to Architectures and Compilation (SMART’07), 2007.
[FHK⁺06]: K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, et al. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, page 83. ACM, 2006.
[FL99]: M. Feng and CE Leiserson. Efficient detection of determinacy races in Cilk programs. Theory of Computing Systems, 32(3):301–326, 1999.
[FLA10]: FLAME Project. http://z.cs.utexas.edu/wiki/flame.wiki/FrontPage, 2010.
[FT87a]: Michael L. Fredman and Robert Endre Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM, 34(3):596–615, 1987.
[FT87b]: M.L. Fredman and R.E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM (JACM), 34(3):596–615, 1987.
[FVPF95]: A. Fauth, J. Van Praet, and M. Freericks. Describing instruction set processors using nML. In Proceedings of the 1995 European conference on Design and Test, page 503, 1995.
[GB03]: M. Geilen and T. Basten. Requirements on the execution of Kahn process networks. Lecture Notes in Computer Science, pages 319–334, 2003.
[GG93]: R. Govindarajan and GR Gao. A novel framework for multi-rate scheduling in DSP applications. In International Conference on Application-Specific Array Processors, pages 77–88, 1993.
[GGR⁺10]: Christian Grothoff, Krista Grothoff, Matthew J. Rutherford, Kai Christian Bader, Harald Meier, Craig Ritzdorf, Tilo Eissler, Nathan Evans, and Chris GauthierDickey. DUP: A Distributed Stream Processing Language. In IFIP International Conference on Network and Parallel Computing, Zhengzhou, China, 2010. Springer Verlag.
[GKN06]: Emden Gansner, Eleftherios Koutsofios, and Stephen North. Drawing graphs with dot, January 2006.
[GLB00]: S. Girona, J. Labarta, and R.M. Badia. Validation of Dimemas communication model for MPI collective operations. Proc. EuroPVM/MPI, 2000.
[GMA⁺02]: M.I. Gordon, D. Maze, S. Amarasinghe, W. Thies, M. Karczmarek, J. Lin, A.S. Meli, A.A. Lamb, C. Leger, J. Wong, et al. A stream compiler for communication-exposed architectures. ASPLOS, pages 291–303, 2002.
[GMN⁺08]: Lewis Girod, Yuan Mei, Ryan Newton, Stanislav Rost, Arvind Thiagarajan, Hari Balakrishnan, and Samuel Madden. Xstream: a signal-oriented data stream management system. Data Engineering, International Conference on, pages 1180–1189, 2008.
[GNU]: GNU Radio. http://www.gnu.org/software/gnuradio/. GNU Software Radio Project.
[GR93]: I. Galperin and R.L. Rivest. Scapegoat trees. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 165–174. Society for Industrial and Applied Mathematics Philadelphia, PA, USA, 1993.
[GR05]: Jayanth Gummaraju and Mendel Rosenblum. Stream Programming on General-Purpose Processors. In MICRO 38: Proceedings of the 38th annual ACM/IEEE international symposium on Microarchitecture, Barcelona, Spain, November 2005.
[Gra71]: RL Graham. Bounds on multiprocessing anomalies and related packing algorithms. In Proceedings of the November 16-18, 1971, fall joint computer conference, pages 205–217. ACM, 1971.
[GSS06]: Clemens Grelck, Sven-Bodo Scholz, and Alex Shafarenko. S-Net: A typed stream processing language. In Zoltan Horváth and Viktória Zsók, editors, Proceedings of the 18th International Symposium on Implementation and Application of Functional Languages (IFL’06), Budapest, Hungary, Technical Report 2006-S01, pages 81–97. Eötvös Loránd University, Faculty of Informatics, Budapest, Hungary, 2006.
[GTA06]: M.I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ASPLOS, pages 151–162, 2006.
[HCAL89]: Jing-Jang Hwang, Yuan-Chieh Chow, Frank D. Anger, and Chung-Yee Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM J. Comput., 18:244–257, April 1989.
[HCK⁺09]: A.H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on, pages 214–223. IEEE, 2009.
[HCW⁺10]: Amir H. Hormati, Yoonseo Choi, Mark Woh, Manjunath Kudlur, Rodric Rabbah, Trevor Mudge, and Scott Mahlke. Macross: macro-simdization of streaming applications. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS ’10, pages 285–296, New York, NY, USA, 2010. ACM.
[HGG⁺99]: A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau. EXPRESSION: A language for architecture exploration through compiler/simulator retargetability. In Proceedings of the conference on Design, automation and test in Europe, 1999.
[HJ03]: Tarek Hagras and Jan Janecek. A simple scheduling heuristic for heterogeneous computing environments. Parallel and Distributed Computing, International Symposium on, 0:104, 2003.
[HL91]: S. Ha and E.A. Lee. Compile-time scheduling and assignment of data-flow program graphs with data-dependent iteration. Computers, IEEE Transactions on, 40(11):1225–1238, Nov 1991.
[HP07]: John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, fourth edition, 2007.
[HPFF93]: High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.1. Scientific Programming, 2(1–2):1–170, November 1993.
[HPFF97]: High Performance Fortran Forum. High Performance Fortran Language Specification, Version 2.0. January 1997.
[HS86]: W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29:1170–1183, December 1986.
[IBM]: IBM. http://www.alphaworks.ibm.com/tech/mtrat. Multi-Thread Run-time Analysis Tool for Java.
[IBM09]: IBM. Cell Broadband Engine Programming Handbook including PowerXCell 8i Version 1.11, 2009.
[IBM11]: IBM. IBM Streams Processing Language Specification. 2011.
[iee99]: IEEE Standard for Information Technology-Portable Operating System Interface (POSIX)-Part 1: System Application Program Interface (API)- Amendment D: Additional Real time Extensions [C Language]. IEEE Std 1003.1d-1999, 1999.
[ILO]: ILOG. http://www.ilog.com/products/cplex/. CPLEX Math Programming Engine.
[Int10]: Intel. A Quick, Easy and Reliable Way to Improve Threaded Performance: Intel Cilk Plus, 2010. http://software.intel.com/en-us/articles/intel-cilk-plus.
[IP95]: K. Ito and K.K. Parhi. Determining the minimum iteration period of an algorithm. The Journal of VLSI Signal Processing, 11(3):229–244, 1995.
[JED10]: J.C. Jenista, Y.H. Eom, and B. Demsky. OoOJava: an out-of-order approach to parallel programming. In Proceedings of the 2nd USENIX conference on Hot topics in parallelism, page 11. USENIX Association, 2010.
[Kar78]: R.M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete mathematics, 23(3):309–311, 1978.
[KET06]: C. Kyriacou, P. Evripidou, and P. Trancoso. Data-driven multithreading using conventional microprocessors. IEEE Transactions on Parallel and Distributed Systems, pages 1176–1188, 2006.
[Khr]: Khronos Group. http://www.opengl.org/.
[Khr10]: Khronos Group. The OpenCL Specification Version: 1.1 Document Revision: 36, 2010.
[Kie99]: B. Kienhuis. Design Space Exploration of Stream-based Dataflow Architectures: Methods and Tools. Delft University of Technology, The Netherlands, 1999.
[KL70]: B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291–307, 1970.
[KL88]: B. Kruatrachue and T. Lewis. Grain size determination for parallel processing. Software, IEEE, 5(1):23–32, January 1988.
[KM⁺72]: D.J. Kuck, Y. Muraoka, et al. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Transactions on Computers, pages 1293–1310, 1972.
[KM08]: M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pages 114–124, 2008.
[Koh75]: W.H. Kohler. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems. IEEE Transactions on Computers, 100(24):1235–1238, 1975.
[KTA03]: Michal Karczmarek, William Thies, and Saman Amarasinghe. Phased scheduling of stream programs. In Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, LCTES ’03, pages 103–112, New York, NY, USA, 2003. ACM.
[KTJR05]: R. Kumar, DM Tullsen, NP Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32–38, 2005.
[LA00]: Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, PLDI ’00, pages 145–156, New York, NY, USA, 2000. ACM.
[Lam74]: L. Lamport. The parallel execution of DO loops. Communications of the ACM, 17(2):83–93, 1974.
[LBS]: W.I. Lundgren, K.B. Barnes, and J.W. Steed. Gedae: Auto Coding to a Virtual Machine.
[LCM⁺05]: Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI ’05, pages 190–200, 2005.
[LDWL06]: S. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the International Symposium on Code Generation and Optimization, pages 196–207. IEEE Computer Society Washington, DC, USA, 2006.
[Lee86]: E.A. Lee. A coupled hardware and software architecture for programmable digital signal processors (synchronous data flow). PhD thesis, University of California, Berkeley, 1986.
[Lee06]: E.A. Lee. The problem with threads. Computer, 39(5):33–42, 2006.
[LK78]: JK Lenstra and A.H.G.R. Kan. Complexity of Scheduling under Precedence Constraints. Complexity, 26(1), 1978.
[LM87]: E.A. Lee and DG Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–1245, 1987.
[LMT⁺04]: F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The stream virtual machine. 13th International Conference on Parallel Architecture and Compilation Techniques, pages 267–277, 2004.
[LSB09]: D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. ACM SIGPLAN Notices, 44(10):227–242, 2009.
[M⁺11]: Harm Munk et al. ACOTES Project: Advanced Compiler Technologies for Embedded Streaming. International Journal of Parallel Programming, 39:397–450, 2011. 10.1007/s10766-010-0132-7.
[MAJ⁺09]: C. Meenderinck, A. Azevedo, B. Juurlink, M. Alvarez, and A. Ramirez. Parallel scalability of video decoders. Journal of Signal Processing Systems, 57(2):173–194, 2009.
[MAS⁺02]: M. Maheswaran, S. Ali, HJ Siegal, D. Hensgen, and R.F. Freund. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In Heterogeneous Computing Workshop, 1999.(HCW’99) Proceedings. Eighth, pages 30–44. IEEE, 2002.
[Mat02]: P.R. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2002.
[Mat04]: P. Mattson. PCA Machine Model, 1.0, 2004.
[MB06]: Joseph Muscat and David Buhagiar. Connective Spaces. Mem. Fac. Sci. Eng. Shimane Univ. Series B: Mathematical Science, 39:1–13, 2006.
[MIT98]: MIT LCS. Cilk 5.4.6 Reference Manual, 1998.
[Moo65]: G.E. Moore. Cramming more components onto integrated circuits(Cramming more components onto integrated circuit for improved reliability and cost). Electronics, 38:114–117, 1965.
[MRC⁺07]: J. Meng, S. Rohinton, S. Che, J. Huang, J.W. Sheaffer, and K. Skadron. Programming with Relaxed Streams. Technical Report CS-2007-17, University of Virginia, 2007.
[MTHV04]: P. Mattson, W. Thies, L. Hammond, and M. Vahey. Streaming virtual machine specification 1.0. Technical report, Technical report, 2004. http://www.morphware.org, 2004.
[Muc97]: Steven S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[Mur71]: Y. Muraoka. Parallelism exposure and exploitation in programs. PhD thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA, 1971. AAI7121189.
[Nana]: Nanos Group. http://nanos.ac.upc.edu/content/presenting-nanos.
[Nanb]: Nanos project. http://nanos.ac.upc.edu/content/mintaka-instrumentation-library. Mintaka Instrumentation Library.
[NS07]: Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89–100, 2007.
[NVI08]: NVIDIA Corporation. http://developer.nvidia.com/cuda/, 2008. NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0.
[OH96]: Hyunok Oh and Soonhoi Ha. A static scheduling heuristic for heterogeneous processors. In Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par’96 Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 573–577. Springer Berlin / Heidelberg, 1996. 10.1007/BFb0024750.
[OH05]: Kunle Olukotun and Lance Hammond. The future of microprocessors. Queue, 3(7):26–29, 2005.
[OIS⁺06]: M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T. Nakatani. MPI microtask for programming the Cell Broadband Engine processor. IBM Systems Journal, 45(1):85–102, 2006.
[Ope09]: OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0, May 2009.
[Org08]: OpenMP Organization. OpenMP Application Program Interface, v. 3.0, May 2008.
[Par95]: T.M. Parks. Bounded scheduling of process networks. PhD thesis, University of California, 1995.
[PBL08]: J.M. Perez, R.M. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In 2008 IEEE International Conference on Cluster Computing, pages 142–151, 2008.
[Pol60]: M. Pollack. The maximum capacity through a network. Operations Research, pages 733–736, 1960.
[Prv06]: M. Prvulovic. Cord: cost-effective (and nearly overhead-free) order-recording and data race detection. In High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 232–243, Feb. 2006.
[PT03]: M. Prvulovic and J. Torrellas. ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In Annual International Symposium on Computer Architecture, volume 30, pages 110–121, 2003.
[RDF98]: N. Ramsey, J.W. Davidson, and M.F. Fernandez. Design principles for machine-description languages. ACM Transactions on Programming Languages and Systems, 1998.
[Rei]: J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. 2007.
[RSL02]: M.C. Rinard, D.J. Scales, and M.S. Lam. Jade: A high-level, machine-independent language for parallel programming. Computer, 26(6):28–38, 2002.
[RVVA04]: R. Rangan, N. Vachharajani, M. Vachharajani, and D.I. August. Decoupled software pipelining with the synchronization array. In Parallel Architecture and Compilation Techniques, PACT 2004. Proceedings. 13th International Conference on, pages 177–188, 2004.
[SB97]: Y. Smaragdakis and D. Batory. DiSTiL: A transformation library for data structures. In Proceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL), 1997, page 20. USENIX Association, 1997.
[SBN⁺97]: S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS), 15(4):391–411, 1997.
[Ser09]: Serebryany, Konstantin and Iskhodzhanov, Timur. ThreadSanitizer—data race detection in practice. In Proceedings of the Workshop on Binary Instrumentation and Applications, pages 62–71, 2009.
[SFB⁺09]: J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan. GRAMPS: A programming model for graphics pipelines. ACM Transactions on Graphics (TOG), 28(1):1–11, 2009.
[SGB06]: S. Stuijk, M. Geilen, and T. Basten. Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the 43rd annual conference on Design automation, pages 899–904, 2006.
[SL93]: G.C. Sih and E.A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems, 4:175–187, 1993.
[SN05]: J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the annual conference on USENIX Annual Technical Conference, page 2. USENIX Association, 2005.
[SP09]: Raül Sirvent Pardell. GRID Superscalar: a Programming Model for the Grid. PhD thesis, Technical University of Catalonia (UPC), 2009.
[THW99]: H. Topcuoglu, S. Hariri, and M.Y. Wu. Task Scheduling Algorithms for Heterogeneous Processors. In Proceedings of the Eighth Heterogeneous Computing Workshop, page 3. IEEE Computer Society, 1999.
[TKA02]: W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. International Conference on Compiler Construction, 4, 2002.
[TKA⁺10]: George Tzenakis, Konstantinos Kapelonis, Michail Alvanos, Konstantinos Koukos, Dimitrios Nikolopoulos, and Angelos Bilas. Tagged procedure calls: Efficient runtime support for task-based parallelism on the cell processor. In Yale Patt, Pierfrancesco Foglia, Evelyn Duesterwald, Paolo Faraboschi, and Xavier Martorell, editors, High Performance Embedded Architectures and Compilers, volume 5952 of Lecture Notes in Computer Science, pages 307–321. Springer Berlin / Heidelberg, 2010.
[TKS⁺05]: William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah, and Saman Amarasinghe. Teleport messaging for distributed stream programs. In Principles and Practice of Parallel Programming, pages 224–235, 2005.
[TOP]: TOP500. http://www.top500.org/.
[Ull75]: J.D. Ullman. NP-complete scheduling problems. Journal of Computer and System Sciences, 10(3):384 – 393, 1975.
[Uni09]: University of Tennessee. PLASMA Users’ Guide, Parallel Linear Algebra Software for Multicore Architectures, 2009.
[VDKV00]: A. Van Deursen, P. Klint, and J. Visser. Domain-specific languages: An annotated bibliography. ACM Sigplan Notices, 35(6):36, 2000.
[vdWdKH⁺04]: P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer, and G. Essink. Design and programming of embedded multiprocessors: an interface-centric approach. Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 206–217, 2004.
[VWY07]: V. Vassilevska, R. Williams, and R. Yuster. All-pairs bottleneck paths for general graphs in truly sub-cubic time. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 585–589. ACM New York, 2007.
[WG90]: M.Y. Wu and D.D. Gajski. Hypertool: A programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330–343, 1990.
[WTS⁺97]: E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, et al. Baring It All to Software: Raw Machines. Computer, pages 86–93, 1997.
[WY02]: Namyoon Woo and Heon Young Yeom. k-depth look-ahead task scheduling in network of heterogeneous processors. In Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II, ICOIN ’02, pages 736–745, London, UK, UK, 2002. Springer-Verlag.
[YG94]: T. Yang and A. Gerasoulis. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors. IEEE Transactions on Parallel and Distributed Systems, 5:951–967, 1994.