Titanium Backend Specification Dan Bonachea September 2001 The following specification describes the interface between the Titanium runtime system (which includes generated code, hand-written native code, and backend-independent runtime subsystems) and the backend implementation for a particular parallel architecture. The interface changes from time to time, so this document is likely to be slightly out of date by the time you read it, but hopefully this specification will serve as a good starting point for people interested in porting Titanium to a new backend, writing some native code or understanding the operation of the runtime system. Note that the tic-based backends (sequential, smp, *-cluster-uniprocess, *-cluster-smp) also have a more stable lower-level interface which may be more appropriate for porting to some systems. These backends use pthreads for intrabox communication and Active Messages (AM-2) for interbox communication. On new distributed systems, you should consider implementing AM-2 on the interconnect of the new distributed platform (as was done for the IBM SP using the AM-to-LAPI layer) and leverage all the existing tic code, rather than re-creating it. The file backend-defines.h must define the appropriate macros from the following set (or not define them if they don't apply) MEMORY_DISTRIBUTED - backend supports more than one local address space MEMORY_SHARED - backend supports multiple threads within a single address space HAVE_MONITORS - backend implements monitors WIDE_POINTERS - backend uses wide pointers to implement titanium pointers (address + boxnum,procnum) TIC_BACKEND_NAME - character string name of the backend The rest of the interface described below must be fully declared by including backend.h *** Parallel job description globals *** For the purposes of this document, the term "thread" refers to a thread of execution corresponding to a Titanium language-level thread (this is also often referred to in the source and macro names as a "proc", even though it doesn't necessarily correspond to a physical processor or UNIX process). The term "box" refers to a single local memory space which may shared by several threads. The system must define each of the following quantities that describe the layout of the parallel job. They may be implemented using constants, global variables, macros, function calls or whatever is appropriate. Care should be given to making them as efficient as possible, as they are called frequently within the runtime system. MYPROC - global 0-based thread number MYBOX - global 0-based box number PROCS - global number of threads BOXES - global number of boxes (shared memory spaces) MYBOXPROCS - number of threads on this box MYBOXPROC - 0-based index of this thread on this box The following relations must hold over these values for all threads in the system: !MEMORY_SHARED && !MEMORY_DISTRIBUTED => PROCS == BOXES == MYBOXPROCS == 1 MYPROC == MYBOX == MYBOXPROC == 0 MEMORY_SHARED && !MEMORY_DISTRIBUTED => PROCS >= 1 0 <= MYPROC < PROCS MYBOXPROCS == PROCS MYBOXPROC == MYPROC MYBOX == 0 BOXES == 1 !MEMORY_SHARED && MEMORY_DISTRIBUTED => PROCS >= 1 BOXES >= 1 0 <= MYPROC < PROCS MYBOX == MYPROC MYBOXPROCS == 1 MYBOXPROC == 0 MEMORY_SHARED && MEMORY_DISTRIBUTED => PROCS >= 1 BOXES >= 1 0 <= MYPROC < PROCS 0 <= MYBOX < BOXES 0 <= MYBOXPROC < MYBOXPROCS threads are numbered contiguously within each box *** Global memory reads & writes *** Most of the global memory macros include one of the following prefixes that indicate the Titanium type of the data being transferred in the read/write. The specific implementation of each type is platform-dependent and defined in runtime/primitives.h: jboolean, jbyte, jchar, jdouble, jfloat, jint, jlong, jshort - standard Java types lp - a local pointer gp - a global pointer bulk - C structs, most likely corresponding to a Titanium immutable void - ??? (never used?) in each of the following, ptr is a global pointer to a value of the appropriate type ASSIGN_GLOBAL_{prefix}(ptr, val) - write val into the location pointed to by ptr ASSIGN_GLOBAL_anonymous_bulk( ptr, count, local ) - copy from the given local ptr to the location pointed to by ptr. the number of bytes copied will be sizeof(*ptr)*count DEREF_GLOBAL_{prefix}(val, ptr) - read from the location pointed to by ptr into val DEREF_GLOBAL_anonymous_bulk( local, count, ptr ) - copy from the location pointed to by ptr to the location pointed to by the given local pointer. the number of bytes copied will be sizeof(*ptr)*count WEAK_ASSIGN_GLOBAL_{prefix}(ptr, val) WEAK_ASSIGN_GLOBAL_anonymous_bulk(ptr, length, val) - same semantics as the corresponding non-weak variants, but the destination memory has undefined content until ti_write_sync() is called WEAK_DEREF_GLOBAL_{prefix}(val, ptr) WEAK_DEREF_GLOBAL_anonymous_bulk(val, length, ptr) - same semantics as the corresponding non-weak variants, but the destination memory has undefined content until ti_read_sync() is called ti_bulk_read(d_addr, s_proc, s_addr, size) - read size bytes of data from global pointer (s_proc, s_addr) into the area indicated by local pointer d_addr ti_bulk_write(d_proc, d_addr, s_addr, size) - write size bytes of data from the local pointer s_addr into the area indicated by global pointer (d_proc, d_addr) ti_bulk_write_weak(d_proc, d_addr, s_addr, size) - same semantics as ti_bulk_write, but the destination memory has undefined content until ti_write_sync() is called ti_write_sync() - blocks until all prior weak writes have completed ti_read_sync() - blocks until all prior weak reads have completed ti_sync() - executes ti_write_sync() and ti_read_sync() char *getenvMaster(const char *) - given a key name, returns the corresponding value from the console environment *** Broadcast *** Broadcasts are generated as calls to the following sequence of macros: BROADCAST_BEGIN( type, sender ) BROADCAST_{prefix}(result, type, sender, value) BROADCAST_END( result, type, sender ) value is the value being broadcast on sender, and unspecified elsewhere type is the exact type of the value being broadcast (e.g. for gps, this will be a precise gp type) prefix is one of the type prefixes defined above corresponding to the type sender is the thread index of the broadcasting thread result should be assigned the result of the broadcast on all threads *** Synchronization *** barrier() - routine that executes a barrier operation across all threads the following must be defined for any backends that define HAVE_MONITORS: tic_monitor_t - a C-type corresponding to the representation of titanium monitors by the backend, which are embedded directly in the object representation. Most monitor operations are defined in terms of a global pointers (jGPointer's) to this opaque data structure monitor_init(tic_monitor_t *m) - initialize the monitor data structure monitor_destroy(tic_monitor_t *m) - cleanup the monitor data structure (not guaranteed to always be called) MONITOR_LOCK( jGPointer monitor ) MONITOR_UNLOCK( jGPointer monitor ) monitor_wait(jGPointer monitor, ___tic_time_t *tm); monitor_notify(jGPointer monitor); monitor_notify_all(jGPointer monitor); lock and unlock have the usual semantics wait, notify and notifyall are described by the Java specification for the corresponding Object methods (e.g. Object.wait()). Note recursive locking must be supported, and wait/wakeup operations release/reacqurie all instances. *** Bootstrapping and general control flow *** void free_resources(void) - perform any cleanup activities appropriate to shutting down a parallel job int main(int argc, char **argv) - the backend is responsible for defining the runtime system entry point. At the very least, the control flow should include the following flow: /* init all global constants */ /* call initialization functions for various subsystems (e.g. region_init()) - see current code for specifics */ barrier(); ti_main(argc, argv); /* the call to the generated titanium entry point (which eventually calls the user's main function) */ barrier();