Titanium Research Projects

The Titanium-related research is divided into the following categories:

Language design
Static analyses and compiler optimizations
Runtime systems
Support tools
Applications

Suggested BiBTeX entries are provided for some of the papers mentioned below.

Language Research

A Brief History of the Titanium Language
The Titanium language draws on the experience from other global address space languages like Split-C and AC. Titanium takes its control model, SPMD parallelism, from message passing libraries like MPI. It takes the communication model from thread-based languages, such as POSIX threads, in which threads communicate implicitly by reading and writing to shared variables. Like Split-C and AC, though, the Titanium address space is partitioned into separate areas, typically one per processor, to give programmers control over data layout. Titanium also draws on the experience of Colella, Hilfinger, and Semenzato on a language for finite differencing techniques, called FiDIL. Titanium includes support for multidimensional arrays and a separate abstraction for index spaces to ease the programming of stencil operations, hierarchical algorithms such as multigrid, and adaptive meshing refinement (AMR). It also provides support for immutable classes, region-based memory management, and operator overloading. An overview paper of the Titanium language is available, along with a language reference and incomplete tutorial. Several of the static analysis techniques listed below involve language support along with compiler and runtime support. An early language design effort was based on C++, and the initial Java version was a language called Split Java, designed by David Gay as a course project.

Static Analyses and Optimizations

Local Qualification Inference (LQI), Ben Liblit
The Titanium language distinguishes between local and global references: a local reference is one that refers to (points to) an object on the same processor, while global references may point to objects on a remote processor. This distinction is important on distributed memory implementations of Titanium, because global pointers (the more general case) incur runtime overhead to determine whether communication is required on a dereference. A programmer may declare a reference as local to avoid this overhead. Using LQI, the compiler automatically infers that certain references are local, which has been shown to produce savings on some applications. The implementation of LQI in the Titanium compiler makes use of the BANE analysis toolkit. A paper is available in postscript or pdf. See Ben Liblit's LQI page for the latest information.
Barrier Inference, Alexander Aiken and David Gay
Many parallel programs, including those written in Titanium, Split-C, and message passing systems like MPI, are written in SPMD style, i.e. by running the same sequential program on all processes. SPMD programs include synchronization, but it is easy to write incorrect synchronization patterns. We propose a system that verifies a program's synchronization pattern. We also propose language features to make the synchronization pattern more explicit and easily checked. We have implemented a prototype of our system for Split-C and successfully verified the synchronization structure of realistic programs. This work is presented in a POPL98 paper and a master's thesis by David Gay.
Memory Management with Explicit Regions, Alexander Aiken and David Gay
In Titanium, as in Java, a kind of safe memory management is used to keep programmers from accessing unallocated memory. Rather than implementing this with garbage collection, Titanium uses a region-based memory management scheme in which each allocation specifies a region, and memory is reclaimed by destroying a region, freeing all the storage allocated therein. This language feature is based on a study done on region-based memory management in allocation-intensive C programs, which shows that regions are competitive with malloc/free and sometimes substantially faster due to locality improvements. Our implementation uses static analysis to improve safety and reduce the cost of runtime checking. Experience with our benchmarks suggests that regions are easy to use. Regions with static analysis are presented in a PLDI98 paper and a related report describing a fully dynamic region-based approach.
Analyses and Optimizations for Shared Address Space Programs, A. Krishnamurthy and K. Yelick
Titanium, Split-C, and most thread-based extensions of C and Fortran involve separate threads that communicate through a global address space. Any type of code motion on such explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. This analysis is similar to dependence analysis, but is called cycle detection, because it involves looking for cycles in the shared variable accesses between processors. This work is related to a large body of research on memory consistency models, and in particular, cycle detection can be use to implement a strong semantic model (such as sequential consistency) on top of hardware that support some weaker model. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for non-blocking memory operations and one-way communication. A paper describing the basic analysis, along with a prototype implementation for Split-C on the Connection Machine CM-5 appeared in the Journal of Parallel and Distributed Computation, 1996. A PhD Thesis by Arvind Krishnamurthy describes more recent work, including theoretical bounds on the algorithm and performance data from the Berkeley NOW. The optimizations include message pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data re-use.
Memory Hiearchy Optimizations in Titanium, Geoff Pike
One of the most striking trends in machine architecture over the last decade has been the widening gap between CPU and memory speeds. Today, the number of cache misses is often a better measure of program performance than the number of instructions executed. The Titanium language has an unordered loop construct called "foreach" to simplify the optimization of loops over arrays. We are developing analyses to reorder the computation on some of the computational kernels that are common on block-structured meshes.

Method Inlining Optimization in Titanium, Dan Bonachea

In this paper, we discuss the design and implementation of a method inlining optimization added to the Titanium compiler. We address issues relevant to the design of any inlining optimization, and specifics relevant to Java and Titanium. Two heuristics are developed for guiding the inlining decisions, a manual, programmer-assisted heuristic and an automatic heuristic based on call graph structure and an analysis of method characteristics. Results are presented detailing the effectiveness of each heuristic and the performance improvements realized.

Runtime Systems Research

Making Titanium Portable
There is ongoing work to make the Titanium compiler more portable. These include implementations to the Cray T3E, which directly uses the hardware instruction for remote memory operations, the Tera machine, which exploits loop-level parallelization to generate large numbers of threads to mask memory latency, and a runtime layer on top of the VIA message standard, which runs on the Millennium cluster at Berkeley. We've also recently added support for UDP-based distributed computing clusters that should work on any cluster with standard TCP/IP networking support.
High-performance communication systems
The latest versions of Titanium include distributed-memory backends that communicate using GASNet, a high-performance communication interface designed especially for SPMD global address-space languages like Titanium (and UPC) that offers better portability and higher-level operations which can leverage hardware-specific features that support a global-address space model.
Titanium also supports using Active Messages 2.0 as the standardized networking interface for some of the older cluster-based parallel backends. Active Messages is a low-level, high-performance communication paradigm first proposed by von Eicken et al. that basically amounts to a super-lightweight RPC mechanism, which is generally implemented as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware. Titanium uses several different AM 2.0 implementations for various backends: Lanai AM AMMPI AMUDP AMLAPI

Support Tools

SDB: A Titanium Debugger, Carleton Miyamoto
Sdb is a GDB 4.17 based debugger with extensions for debugging executables created with the reference Titanium compiler. It supports a wide range of platforms as well as the various runtime models generated by the compiler. The User's Manual is available in html or gzipped postscript, along with an example Titanium program. See Carleton Miyamoto's sdb page for up-to-date information on sdb.
Foreign Function Interfaces
As an initial study of extending Titanium to support unstructured grid, we have developed an interface to the PETSc library for finite element problems. Petsc is implemented in C using and object-oriented style; this work is also exploring the general problem of interfacing to libraries written in other languages.
Floating Point in Java
Titanium targets scientific applications, which are dominated by floating point computations. Research on floating point support in Java and a proposal for improvement was done by Joe Darcy.

Applications

Immersed Boundary Method Simulations, Ed Givelberg, Armando Solar
Complex biological systems containing tissue immersed in a viscous incompressible fluid are ubiquitous. The tissue may be elastic or active, and it may posess complicated internal structure. Its interaction with the fluid is often coupled with other physical processes, such as biochemical reactions, electrical currents and heat diffusion. We have developed an algorithm for efficient computation of the immersed boundary method on distributed systems using Titanium. The IB software package is based on this algorithm.

Full-sized movie The software is currently being used in several current and future immersed boundary simulation applications. Currently the two main projects are a larger, more detailed model of the cochlea, and a distributed implementation of C. Peskin and D. McQueen's heart model, which we are working on refining. This work is part of the NPACI Alpha program.
Full-sized movie
Titanium Chombo Project, Tong Wen and Phillip Colella

The goal of this project is twofold. First, we want to build the CHOMBO infrastructure for AMR (adaptive mesh refinement) applications in the Titanium environment, meanwhile providing a test case for Titanium's performance and programmability. We hope new AMR algorithms can be developed easier in Titanium by leveraging its strength - language-level, one-sided high-performance communication.
Currently, we are developing AMR algorithms for ocean modeling.
AMR Poisson Solver, Luigi Semenzato, Geoff Pike, and Phillip Colella
This application is an elliptic solver that solves the Poisson equation with variable coefficients; the algorithm is for a finite domain in 3 dimensions. The algorithm uses block-structured adaptive mesh refinement (AMR) that highlights the utility of Titanium's multidimensional array, point, and domain abstractions. Some examples of physical phenomenon that are modeled by the Poisson equations are: electrostatic and gravitational potential, the heat equation, and vorticity. We have worked with Astrophysists Richard Klein and Chris McKee to use this solver on problems that arise in the collapse phase of star formation. One of the challenges in achieving high performance in a multigrid-style algorithm is the high ratio of communication to computation. Values from adjacent grid cells sometimes live on different processors.
Poisson Solver over Infinite Domains, by Greg Balls and Phillip Colella

This algorithm solves the Poisson equation for constant coefficients over an infinite domain. The current implementation is for two dimensional domains and is non-adaptive; there are plans to extend it to three dimensions in the near future, and possibly to develop an adaptive version of the algorithm. One of the key features of this algorithm is its scalability. Rather than performing some kind of relaxation across the entire domain, complete solves are done locally on a fine grid patches without communication, and then a single coarse grid is constructed over the entire domain. There are implementation of this algorithm in both Titanium and in the Kelp library.
Gas Dynamics, Peter McCorquodale and Phillip Colella
This code solves the compressible Euler equations for gas dynamics, which is an example of a hyperbolic system of conservation force laws. An example is advection. The algorithm is currently implemented for two dimensional domains and is nonadaptive, although an adaptive version is underway and the modification to three dimensions is planned for the near future. The algorithm uses infrequent communication, and should therefore scale well to large numbers of processors. A paper describing this work is available.
Microarray Optimal Oligo Selection Engine, Dan Bonachea, Jarrod Chapman and Nik Putnam
The combined availability of a rapidly growing database of genomic sequences, and the technology to create oligonucleotide arrays provides novel opportunities for genome-wide experimental measurements of biological systems. One challenge in exploiting this technology is the selection of oligonucleotide probe sequences for making arrays.
Because of the increasing size of genomic sequence databases, determining an optimal set of probe sequences is expensive computationally. We've implemented a parallel application in Titanium for selecting optimal oligonucleotide sequences from the entire genome of simple organisms. Oligos are rated and selected based on melting temperature, secondary structure (folding properties), and uniqueness amongst the genome of interest.

Also available: A Paper More Info
Titanium Benchmarks
Several smaller benchmarks have also been developed in Titanium, written by members of the group. These include: Cholesky factorization for dense matrices, Guassian elimination (for dense matrices, without pivoting), Quicksort, Term unification, SAXPY, 2D Tridiagonal solver, Matrix Multiplication (including Cannon's algorithm), Particle-in-Cell method, MP3D, 1d FFT, simple 2D Multigrid, Bitonic sorting, Sparse Matrix-Vector Multiplication, EM3D, and two particle simulations.

Back to main page
Last updated Tuesday April 29, 2014
Comments to Titanium developers