|
Ct: C for Throughput Computing
One of the main challenges in scaling multi-core for the future is that of migrating programming tools, build environments, and millions of lines of existing code to new parallel programming models or compilers. To help this transition, Intel researchers are developing “Ct,” or C/C++ for Throughput Computing. The Motivation for Ct
Increased software-exposed parallelism in microprocessors and co-processors (like GPUs) is driving development of applications and programming models that scale to utilize many threads and vector instructions. This need is multiplied as these “throughput” architectures will increasingly require software hiding of the exposed micro-architectural latencies through the use of additional threads. Collection-oriented (or data-oriented) parallelism will comprise a large majority of parallelized and vectorized compute cycles in future programs. In a future architecture supporting hundreds to thousands of hardware threads and thousands of software threads, thread creation will often be driven by small and large data structures representing vectors, hierarchies, hash tables, etc. Truly heterogeneous task parallelism will almost certainly exist in most applications to some extent. Having a flexible, parallel collection-oriented building block for user-defined classes will also be essential. Using Ct, developers can provide a parallel-ready version of common data types that C/C++ programmers already use.
How does it work?
In principal, Ct works with any standard C++ compiler because it is a standards-compliant C++ library (with a lot of runtime behind the scenes). When one initializes the Ct library, one loads a runtime that includes the compiler, threading runtime, memory manager — essentially all the components one needs for threaded and/or vectorized code generation. Ct code is dynamically compiled, so the runtime tries to aggregate as many smaller tasks or data parallel work quanta so that it can minimize threading overhead and control the granularity according to run-time conditions. With the Ct dynamic engine, one gets precise inter-procedural traces to compile, which is extremely useful in the highly modularized and indirect world of object-oriented programming. Collections in Ct
One of the justified criticisms of data-parallel programming models is that they focus on the very restrictive (but performance friendly) vector as the primitive parallel collection. Defining parallel operations over vectors is useful when a data type fits nicely into a flat, contiguous segment of memory. For dense linear algebra, for example, this is a convenient abstraction. What if the collection is a sparse matrix, a tree, a graph, or a set of value-key associations? With Ct, the programmer can do this using a TVEC, or a “throughput vector”.
Attributes like length and, more generally, shape are dynamically determined. The programmer that wants more control can declare a TVEC with a particular shape. The declaration here looks very much like a reference type in C++. ADoubleVec refers to an immutable value allocated on Ct’s heap. TVECs are managed (i.e. garbage-collected). If one assigns a new value to a TVEC variable, one is not overwriting the old value (though the compiler may do this). This is because another computation may need this old value, especially as the runtime compiler is aggressively reordering and parallelizing computations. In this respect, TVEC values are immutable. A flat vector: A 2D matrix: An irregular, nested vector (like segmented vectors in Nesl): An indexed vector: The base types of TVECs can be any of the usual value types, and will soon include tuples and records (structs). Ct also supports operators on a full range of scalar value types. Operators in Ct
All operators on TVECs are implicitly parallel. All operators in Ct are uniformly defined over TVECs of all shape. Though the implementations of various shapes may be more complex than others, this is a detail handled by the runtime compiler. There are several broad categories of Ct operators, three of which we will cover here: element-wise, collective communication, and permutation.
Things get a little more interesting with collective communication operations. These operators include reductions and scans (or, as they are otherwise known, prefix-sums). A reduction simply adds a binary operator over an entire collection, essentially it’s innermost dimension. For flat vectors, this means one reduces the vector to a single value. For example, the following is the result of a sumReduce:
Note that while the semantics are consistent and sensible, the way reductions are performed depends heavily on the shape of the vector. The programmer does not have to worry about this (though the runtime does). They can program generically against a consistent set of reduction and scan operators. Scans are similar to reductions, but do not reduce dimensionality, instead retaining the partial results in sequence order. This is quite useful for sorting operations.
User-defined Operators in Ct
Ct supports a complete scalar set of types for a couple of reasons. First, operations involving both TVECs and scalars are useful. Second, programmers should be able to define their own operations or function to apply across TVECs or to use as combining operators in reductions and scans. (For functional programming folks, think map, fold, and foldlist.) This flavor of data parallel programming can be referred to as “kernel” style, as opposed to the “vector” flavor also supported in Ct. Both are useful in differing circumstances. For linear algebra, we find the “vector” style easier to express algorithms, whereas for some signal and image processing algorithms, we find the “kernel” style easier. The underlying machinery is the same, but having both styles is an expressive convenience to the programmer.
The programmer can write:
Where the programmer has defined:
This is a basic example. Things get interesting when the programmer introduces some control flow (If-Then-Else's, For loops, While loops) or wants to touch neighboring pixels, say, for a 3x3 filter.
The alternative would have been to shift this image around, logically creating 4 additional copies. Though the compiler would have optimized these copies away, it’s a little misleading and not transparent enough in terms of the programmer’s intent. One can even express partial orders on the applications of these kernels (think deblocking filters for H.264):
With reduction and scans, things get more interesting. For circuit folks, carry look-ahead adders are just specialized scans. Tasks in Ct
The Ct API is a declarative way of specifying parallel tasks that are dependent on each other. So, collections of Ct operations on TVECs become collections of task dependence graphs. This still might be too constraining for those who would like to get at tasking abstractions directly. To this end, Ct introduces two abstractions for task parallelism. Execution Semantics of Ct
Ct is a dynamically compiled via a language runtime through library initialization. This is done so that one can expose these higher-level abstractions to heavy compiler optimizations, especially as one migrates code between platforms. This means that the Ct API calls used in a program may execute out of order and asynchronously with respect to the rest of one's C/C++ code. The only Ct API calls that are guaranteed to be “synchronous” with the C/C++ code are those that move data back and forth between TVECs and native C/C++ memory. The semantics of Ct allow this aggressive reordering by the compiler and runtime, though there will be some effort required if one is heavily intermixing C/C++ code and Ct API calls to optimize one's code. For functional programmers, the evaluation of Ct API calls can be viewed as “lenient” (Ct is strict and purely functional). For web programmers, the execution model will look similar to Python in that one “executes” all the code at least once, including the object declarations (which are bits of code that create and initialize the objects), then one uses the objects and their methods. In Ct, the analogy would be that the C/C++ sequencing of Ct API calls is the first pass through, then the runtime’s code cache is used to avoid recompilation. Learn More
Please follow the following links for more information. As a research project, Ct is not presently available for download and use.
|