The Titanium language draws on the experience from other global address space languages like Split-C and AC. Titanium takes its control model, SPMD parallelism, from message passing libraries like MPI. It takes the communication model from thread-based languages, such as POSIX threads, in which threads communicate implicitly by reading and writing to shared variables. Like Split-C and AC, though, the Titanium address space is partitioned into separate areas, typically one per processor, to give programmers control over data layout. Titanium also draws on the experience of Colella, Hilfinger, and Semenzato on a language for finite differencing techniques, called FiDIL. Titanium includes support for multidimensional arrays and a separate abstraction for index spaces to ease the programming of stencil operations, hierarchical algorithms such as multigrid, and adaptive meshing refinement (AMR). It also provides support for immutable classes, region-based memory management, and operator overloading. An overview paper of the Titanium language is available, along with a language reference and incomplete tutorial. Several of the static analysis techniques listed below involve language support along with compiler and runtime support. An early language design effort was based on C++, and the initial Java version was a language called Split Java, designed by David Gay as a course project.
The Titanium language distinguishes between local and global references: a local reference is one that refers to (points to) an object on the same processor, while global references may point to objects on a remote processor. This distinction is important on distributed memory implementations of Titanium, because global pointers (the more general case) incur runtime overhead to determine whether communication is required on a dereference. A programmer may declare a reference as local to avoid this overhead. Using LQI, the compiler automatically infers that certain references are local, which has been shown to produce savings on some applications. The implementation of LQI in the Titanium compiler makes use of the BANE analysis toolkit. A paper is available in postscript or pdf. See Ben Liblit's LQI page for the latest information.
Many parallel programs, including those written in Titanium, Split-C, and message passing systems like MPI, are written in SPMD style, i.e. by running the same sequential program on all processes. SPMD programs include synchronization, but it is easy to write incorrect synchronization patterns. We propose a system that verifies a program's synchronization pattern. We also propose language features to make the synchronization pattern more explicit and easily checked. We have implemented a prototype of our system for Split-C and successfully verified the synchronization structure of realistic programs. This work is presented in a POPL98 paper and a master's thesis by David Gay.
In Titanium, as in Java, a kind of safe memory management is used to keep programmers from accessing unallocated memory. Rather than implementing this with garbage collection, Titanium uses a region-based memory management scheme in which each allocation specifies a region, and memory is reclaimed by destroying a region, freeing all the storage allocated therein. This language feature is based on a study done on region-based memory management in allocation-intensive C programs, which shows that regions are competitive with malloc/free and sometimes substantially faster due to locality improvements. Our implementation uses static analysis to improve safety and reduce the cost of runtime checking. Experience with our benchmarks suggests that regions are easy to use. Regions with static analysis are presented in a PLDI98 paper and a related report describing a fully dynamic region-based approach.
Titanium, Split-C, and most thread-based extensions of C and Fortran involve separate threads that communicate through a global address space. Any type of code motion on such explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. This analysis is similar to dependence analysis, but is called cycle detection, because it involves looking for cycles in the shared variable accesses between processors. This work is related to a large body of research on memory consistency models, and in particular, cycle detection can be use to implement a strong semantic model (such as sequential consistency) on top of hardware that support some weaker model. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for non-blocking memory operations and one-way communication. A paper describing the basic analysis, along with a prototype implementation for Split-C on the Connection Machine CM-5 appeared in the Journal of Parallel and Distributed Computation, 1996. A PhD Thesis by Arvind Krishnamurthy describes more recent work, including theoretical bounds on the algorithm and performance data from the Berkeley NOW. The optimizations include message pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data re-use.
One of the most striking trends in machine architecture over the last decade has been the widening gap between CPU and memory speeds. Today, the number of cache misses is often a better measure of program performance than the number of instructions executed. The Titanium language has an unordered loop construct called "foreach" to simplify the optimization of loops over arrays. We are developing analyses to reorder the computation on some of the computational kernels that are common on block-structured meshes.
In this paper, we discuss the design and implementation of a method inlining optimization added to the Titanium compiler. We address issues relevant to the design of any inlining optimization, and specifics relevant to Java and Titanium. Two heuristics are developed for guiding the inlining decisions, a manual, programmer-assisted heuristic and an automatic heuristic based on call graph structure and an analysis of method characteristics. Results are presented detailing the effectiveness of each heuristic and the performance improvements realized.
There is ongoing work to make the Titanium compiler more portable. These include implementations to the Cray T3E, which directly uses the hardware instruction for remote memory operations, the Tera machine, which exploits loop-level parallelization to generate large numbers of threads to mask memory latency, and a runtime layer on top of the VIA message standard, which runs on the Millennium cluster at Berkeley. We've also recently added support for UDP-based distributed computing clusters that should work on any cluster with standard TCP/IP networking support.
The latest versions of Titanium include distributed-memory backends
that communicate using
GASNet,
a high-performance communication interface designed especially for
SPMD global address-space languages like Titanium (and UPC)
that offers better portability and higher-level operations which can
leverage hardware-specific features that support a global-address space model.
Titanium also supports using Active Messages 2.0 as the standardized networking interface for some of the older cluster-based parallel backends. Active Messages is a low-level, high-performance communication paradigm first proposed by von Eicken et al. that basically amounts to a super-lightweight RPC mechanism, which is generally implemented as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware. Titanium uses several different AM 2.0 implementations for various backends: Lanai AM AMMPI AMUDP AMLAPI |
We perform an empirical comparison of the Cray-T3E and the Cray-T3D from the perspective of compiling a global address space language. Both machines provide an elaborate shell to support different forms of global memory access. We provide a detailed performance characterization of the machine primitives and evaluate their utility in code generation for a parallel language. We observe that the changes made in the T3E result in fewer options for implementing remote access and overall a much simpler compilation strategy. Unfortunately, the raw hardware performance of some of the remote access mechanisms have worse performance on the T3E due to extra logic on the destination processor and the presence of a second level cache on the source processor. However, the language implementation adds less overhead than a corresponding implementation on the T3D resulting in better end-to-end performance for some of the primitives. Papers describing the T3D support and a comparison between the T3D and T3E are both available.
The file I/O classes present in Java have proven too inefficient to meet
the demands of high-performance applications that perform large amounts of
I/O. The inefficiencies stem primarily from the library interface which
requires programs to read arrays a single element at a time. We present two
extensions to the Java I/O libraries which alleviate this problem. The
first adds bulk (array) I/O operations to the existing libraries, removing
much of the overhead currently associated with array I/O. The second is a
new library that adds direct support for asynchronous I/O to en-able
masking I/O latency with overlapped computation. The extensions were
implemented in Titanium, a high-performance, parallel dialect of Java. We
present experimental results that compare the performance of the extensions
with the existing I/O libraries on a simple, external merge sort
application. The results demonstrate that our extensions deliver vastly
superior I/O per-formance for this array-based application.
The paper (
PDF,
gzipped-Postscript )
describing this work will appear in the Java Grande 2000 Conference.
This copy is posted by permission of ACM and may not be reditributed.
The definitive version of this paper is located
here.
A later paper (
Postscript ) on the same topic.
Also see the Bulk I/O Documentation
Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access, including network co-processors that range in capability for limited special-purpose communication devices to general purpose co-processors with the same computational capabilities as the main processor. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations of a global address language on the Thinking Machines CM-5, Intel Paragon, Meiko CS-2, Cray T3D, and Berkeley NOW. This evaluation includes a range of compilation strategies that make varying use of the network processor; each is optimized for the target architecture and the particular strategy. We analyze a family of interacting issues that determine the performance trade-offs in each implementation, quantify the resulting latency, overhead, and bandwidth of the global access operations, and demonstrate the effects on application performance. A paper describing this work appeared in ASPLOS '96.
Titanium defines a data consistency model equivalent to the Java specification. This consistency model calls for locally sequential consistency, with global consistency attained at synchronization points and arbitrary write reordering. The Titanium language is targeted at homogeneous multiprocessors on distributed memory architectures. Accesses to distant memory locations may result in communication over a network. The Themis system builds a Titanium-compliant consistency interface for Active Messages running on a NOW. This paper examines several methods for designing a Themis system that performs well for Titanium applications. Microbenchmark tests and a Conjugate Gradient Method test show the viability of each of these designs. One of the methods, a caching mechanism called OrderCache, displays promise for reducing communication loads and exploiting the locality of data within applications. A paper is available in postscript or pdf. See Ben Liblit's Themis page for further information.
Titanium is an explicitly-parallel SPMD language, which means that there are fixed set of threads started at program startup time. In most implementations, there is a single thread per processor. While SPMD programming is easy to understand, it lacks the expressiveness of multi-threading, which allows the programming to separate functionally distinct segments of control flow. In this report we introduce a multi-threaded extension of Titanium that can coexist with SPMD programming and the global synchronization primitives, such as barriers, that are common in SPMD program. We discuss the changes to the Titanium language and the consequences on ensuring correct and deadlock-free code. We also describe a prototype implementation of this extended Titanium.
Titanium allows for both global synchronization, such as barriers and reductions, and pairwise synchronization for mutual exclusion. The language mechnanism for the latter is taken from Java: methods or clode blocks may be annotated with the keyword "synchronized" which guarantees that at most one thread will be in that block or method at the same time. The runtime systems supports this from of synchronization by associating locks with objects. In Titanium implementations on shared memory machines, this is built directly on the underlying locking mechanism from a system like POSIX threads. On distributed memory platforms, like the Berkeley NOW, locking is integrated into the global referencin mechanism so that the difference between shared and distributed memory is functionally transparent. a
Sdb is a GDB 4.17 based debugger with extensions for debugging executables created with the reference Titanium compiler. It supports a wide range of platforms as well as the various runtime models generated by the compiler. The User's Manual is available in html or gzipped postscript, along with an example Titanium program. See Carleton Miyamoto's sdb page for up-to-date information on sdb.
As an initial study of extending Titanium to support unstructured grid, we have developed an interface to the PETSc library for finite element problems. Petsc is implemented in C using and object-oriented style; this work is also exploring the general problem of interfacing to libraries written in other languages.
Titanium targets scientific applications, which are dominated by floating point computations. Research on floating point support in Java and a proposal for improvement was done by Joe Darcy.
Complex biological systems containing tissue immersed in a viscous incompressible fluid are ubiquitous. The tissue may be elastic or active, and it may posess complicated internal structure. Its interaction with the fluid is often coupled with other physical processes, such as biochemical reactions, electrical currents and heat diffusion. We have developed an algorithm for efficient computation of the immersed boundary method on distributed systems using Titanium. The IB software package is based on this algorithm.
Full-sized movie |
The software is currently being used in several current and future immersed boundary simulation applications. Currently the two main projects are a larger, more detailed model of the cochlea, and a distributed implementation of C. Peskin and D. McQueen's heart model, which we are working on refining. This work is part of the NPACI Alpha program. | Full-sized movie |
The goal of this project is twofold. First, we want
to build the CHOMBO
infrastructure for AMR (adaptive mesh
refinement) applications in the Titanium
environment, meanwhile providing a test case for
Titanium's performance and programmability. We hope
new AMR algorithms can be developed easier in
Titanium by leveraging its strength - language-level,
one-sided high-performance communication.
Currently, we are developing AMR algorithms for ocean modeling. |
This application is an elliptic solver that solves the Poisson equation with variable coefficients; the algorithm is for a finite domain in 3 dimensions. The algorithm uses block-structured adaptive mesh refinement (AMR) that highlights the utility of Titanium's multidimensional array, point, and domain abstractions. Some examples of physical phenomenon that are modeled by the Poisson equations are: electrostatic and gravitational potential, the heat equation, and vorticity. We have worked with Astrophysists Richard Klein and Chris McKee to use this solver on problems that arise in the collapse phase of star formation. One of the challenges in achieving high performance in a multigrid-style algorithm is the high ratio of communication to computation. Values from adjacent grid cells sometimes live on different processors.
This algorithm solves the Poisson equation for constant coefficients over an infinite domain. The current implementation is for two dimensional domains and is non-adaptive; there are plans to extend it to three dimensions in the near future, and possibly to develop an adaptive version of the algorithm. One of the key features of this algorithm is its scalability. Rather than performing some kind of relaxation across the entire domain, complete solves are done locally on a fine grid patches without communication, and then a single coarse grid is constructed over the entire domain. There are implementation of this algorithm in both Titanium and in the Kelp library.
This code solves the compressible Euler equations for gas dynamics, which is an example of a hyperbolic system of conservation force laws. An example is advection. The algorithm is currently implemented for two dimensional domains and is nonadaptive, although an adaptive version is underway and the modification to three dimensions is planned for the near future. The algorithm uses infrequent communication, and should therefore scale well to large numbers of processors. A paper describing this work is available.
The combined availability of a rapidly growing database of genomic sequences, and the technology to create oligonucleotide
arrays provides novel opportunities for genome-wide experimental measurements of biological systems. One challenge in
exploiting this technology is the selection of oligonucleotide probe sequences for making arrays.
Because of the increasing size of genomic sequence databases, determining an optimal set of probe sequences is expensive
computationally. We've implemented a parallel application in Titanium for selecting optimal oligonucleotide sequences from
the entire genome of simple organisms. Oligos are rated and selected based on melting temperature, secondary structure
(folding properties), and uniqueness amongst the genome of interest.
Also available:
A Paper
More Info
Several smaller benchmarks have also been developed in Titanium, written by members of the group. These include: Cholesky factorization for dense matrices, Guassian elimination (for dense matrices, without pivoting), Quicksort, Term unification, SAXPY, 2D Tridiagonal solver, Matrix Multiplication (including Cannon's algorithm), Particle-in-Cell method, MP3D, 1d FFT, simple 2D Multigrid, Bitonic sorting, Sparse Matrix-Vector Multiplication, EM3D, and two particle simulations.