GASNet collectives design notes =============================== * $Revision: 1.7.1.4 $ * Copyright 2004, Dan Bonachea and Paul Hargrove * Terms of use are as specified in license.txt //=========================================================================== * Goals of the design: ** 1) Must be asynchronous, with distinct initiation and finalization calls. ** 2) Must be "suitable" for implementing the UPC collectives. Specifically, we need to be able to enforce at least the minimum input/output constraints of the UPC sync_mode. ** 3) Must be suitable for use on all (reasonable) GASNet platforms, including those with unaligned segments, progress via polling only, etc. ** 4) Must be suitable for use in hierarchical implementations on CLUMPS (i.e. pthreaded clients and SysV shared memory bypass inside GASNet when it's enabled) ** 5) Must be "efficient" in terms of synchronization - for instance avoid use of a full barrier where point-to-point synchronization is faster and still sufficient for correctness. Of course, this goal will be achieved differently on each platform and the reference version need not be infinitely tunable. ** 6) Must admit optimizations using likely hardware features where appropriate (barriers, broadcast, RDMA-atomics). ** 7) Must NOT break any existing GASNet semantics. ** 8) MAY assume that calls which initiate collective calls are, in fact, collective with calls made in the same order on all participating nodes, in the same barrier phase. Synchronization calls may happen in different orders on different nodes, but collectives will never be in-flight across a barrier notify/wait interval. ** 9) Must allow for address (and related) arguments which are not "single-valued" in the sense used by the UPC-spec. For instance, when implementing the Titanium exchange operation we have an operation which is similar to a upc_all_gather_all, except that each node knows only its own source and destination addresses, not the addresses on the other addresses. It is reasonable to expect that GASNet can move the addresses around more efficiently than a network-independent GASNet client. (With AMs, GASNet might even move the data w/o ever moving the addresses.) ** 10) MAY perform "extra" communication and/or computation where there is some net savings (examples include implementing upc_all_scatter() with a broadcast and then discarding some of the data at each node, or use of full barriers where hardware support makes them superior to pairwise synchronization.) *** 11) Must be suitable for implementing the collective support required by Titanium (which may include future extensions to utilize the new GASNet support) and Co-Array FORTRAN. Suitability for implementing the MPI collectives is also valuable, but not crucial. *** 12) Should provide support for aggregating the message and synchronization traffic of separate in-flight collectives, both within the GASNet implementation and with explicit help from the client. For example, the compiler should be able to transform several source-level IN_ALLSYNC|OUT_ALLSYNC collectives issued in a row which it determines to be data-independent into a series of GASNet calls expressing that all the collectives should be worked on simultaneously - ideally causing the implementation to piggy-back any data movement together, and use a single entry and exit barrier for the entire group of collectives. *** 13) Should support operating on in-segment and out-of-segment data *** 14) Should allow extension to variable-contribution data movement collectives in the future. *** 15) Should allow extension to team-based collectives in the future. //=========================================================================== * Collectives and barriers At present it is not legal for GASNet collectives and barriers to be outstanding simultaneously. This means that collectives may not be initiated between gasnet_barrier_notify() and the following _try() or _wait(), and collectives initiated before a gasnet_barrier_notify() call must also be synced before it. We may relax this constraint in a future specification. //=========================================================================== * Collectives and handlers GASNet collectives may not be initiated or synchronized from within AM handlers. This constraint will NOT be relaxed. //=========================================================================== * Handles All GASNet collective initiation functions will have blocking and explicit-handle non-blocking versions. The blocking versions exist because in some cases hardware support (such as the blocking hardware barrier in elan) can implement them more efficiently than the trivial one-liner. We omit implicit-handle operations because we anticipate collectives will be used primarily by library authors with the number of outstanding operations being small and known at compile time, not by code generators where the number of operations may be potentially large and unknown at compile time. For many reasons we desire to have a handle type which is distinct from the gasnet_handle_t used for Extended API. For the moment I'm assuming a gasnet_coll_* prefix for functions. So we introduce "gasnet_coll_handle_t" and "GASNET_COLL_INVALID_HANDLE" to the GASNet specification. Meanings are mostly analogous to the existing explicit handle, including: + A handle has per-thread scope and must be synced by the same thread which initiated it. + A non-blocking collective may return GASNET_COLL_INVALID_HANDLE to indicate the operation was completed synchronously. + The invalid handle is all zero bits. + A handle is "dead" once it has been successfully synced and may not be used in future synchronization operations. //=========================================================================== * Synchronization modes In order to efficiently implement the UPC collectives, the GASNet collectives will include the same nine sync modes that are described in the UPC collectives specification (though names and binary representations need not follow UPC). To keep the synchronization calls lightweight, the synchronization mode will be passed to the initiation function, but not to the synchronization function. Here, using the UPC naming, are the three input sync modes given in terms of GASNet semantics: + IN_NOSYNC: All data movement may begin on any GASNet node as soon as any node has entered the collective initiation function. This is sufficient synchronization when, for instance, no input data has been modified in the current barrier phase. + IN_MYSYNC: Each block of data movement may begin as soon as both its source and destination nodes have entered the collective initiation function. This is sufficient synchronization when, for instance, all modification to the input data in the current barrier phase (if any) was done locally. + IN_ALLSYNC: Data movement may begin only when all nodes have entered the collective initiation function. This is potentially the most costly synchronization, and is necessary when remote modification of input data (via a put) is possible in the current barrier phase. Here are the three output sync modes, with the corresponding GASNet semantics: + OUT_NOSYNC: The sync of a collective handle can succeed at any time so long as the last thread to sync does not do so until all data movement has completed. This is sufficient synchronization when no output data will be consumed in the current barrier phase. + OUT_MYSYNC: The sync of a collective handle can succeed as soon as all data movement has completed to and from local data areas specified in the call (note that movement to or from buffers internal to the implementation and local to the node might still be taking place, for instance in tree-based broadcasts). This is sufficient when, for instance, output data will be consumed only locally (if at all) in the current barrier phase. + OUT_ALLSYNC: The sync of a collective handle can succeed as soon as all data movement has completed to and from all data areas specified in the call. This is weaker than a full barrier, since the equivalent of a zero-byte broadcast is sufficient. This synchronization mode is potentially the most costly, and is necessary when output data may be consumed remotely (via a get) in the current barrier phase. In the presence of partial information (see Partial Information, below) the descriptions above "all data areas specified in the call" means the union of the information available from all nodes. For the output syncs, I think the following names are more appropriate to GASNet than the UPC names (where the prefix is TBD): + UPC_OUT_ALLSYNC -> *_ALL implies all buffers are safe for reuse. + UPC_OUT_MYSYNC -> *_LOCAL implies local buffers are reusable + UPC_OUT_NOSYNC -> *_NONE implies no specific buffers are reusable, until the last thread enters the next collective (barrier included) //=========================================================================== * Bulk vs. non-bulk data reuse Consistent with the VIS (Vector, Indexed and Strided) interface, the collectives are specified only for "bulk" data lifetimes. This means that if the client reads or writes source or destination memory areas between the initiation and synchronization of that memory, the results are undefined. As noted in "Synchronization Modes" the point at which the memory (as opposed to the collective operation) is considered synchronized depends on the synchronization mode. + _OUT_ALLSYNC: memory on all participating nodes is synchronized when the collective operation is synchronized on any node. + _OUT_MYSYNC: memory on each participating node is synchronized when the collective operation is synchronized on the same node. + _OUT_NOSYNC: memory is synchronized when the collective operation is synchronized on all participating nodes. //=========================================================================== * Bulk vs. non-bulk alignment The GASNet collectives for data movement do not assume any data alignment (consistent with UPC and with the GASNet-VIS interface). The computational collectives (reduce, prefix_reduce and sort) requires alignment correct for the operation's data type. //=========================================================================== * Synchronization calls We'll support the same family of explicit handle try and wait calls as for the gets and puts. See the section "Synchronization Modes" for a description of what success of a sync call implies. void gasnet_coll_wait_sync(gasnet_coll_handle_t handle); int gasnet_coll_try_sync(gasnet_coll_handle_t handle); void gasnet_coll_wait_sync_all(gasnet_coll_handle_t *handles, size_t n); int gasnet_coll_try_sync_all(gasnet_coll_handle_t *handles, size_t n); void gasnet_coll_wait_sync_some(gasnet_coll_handle_t *handles, size_t n); int gasnet_coll_try_sync_some(gasnet_coll_handle_t *handles, size_t n); Synchronization of a GASNet collective is NOT itself a collective operation and the returns from the synchronization calls on individual nodes may occur as early as permitted by the sync mode argument passed to the collective initiation function. Specifically, synchronization calls need not occur in the same order on all nodes. //=========================================================================== * Metadata lifetime Where arguments are passed by reference to the collective initiation functions, the argument values must remain unchanged between the initiation and synchronization of the collective operation on the node owning the referenced memory. If this meta data lies in the GASNet segment, clients should take care to prevent remote modification of the meta-data, especially since a remote node may sync before the local node has finished with the meta-data. //=========================================================================== * Progress We expect the non-blocking collectives to make progress in gasnet_AMPoll() or the sync call as the worst case. A separate section addresses how this will interface with the conduit implementation to make progress in gasnet_AMPoll() or asynchronously when possible. //=========================================================================== * Collectives and threads Collectives must be initiated in the same order on all participating nodes. To execute a collective operation a collective initiation function must be called on all nodes. Depending on flag values, this call may be made either by all threads on each node (when GASNET_COLL_ALL_THREADS appears in the flags argument), or exactly once on each node, by any one representative thread (when GASNET_COLL_ALL_THREADS is absent from the flags). When GASNET_COLL_ALL_THREADS is included in the flags, all threads must sync their respective handles. To disambiguate the order in which collectives are initiated by representative threads on multi-threaded clients, the client must serialize calls to the collective initiation functions which lack GASNET_COLL_ALL_THREADS in the flags, to ensure that such calls are never concurrent with any other collective initiation calls on the same node (regardless of the presence of absence of GASNET_COLL_ALL_THREADS in the flags). At present, all nodes must agree on the use of GASNET_COLL_ALL_THREADS in the flags. However, this requirement may be relaxed in a future specification. The type gasnet_coll_handle_t specifies a handle which may be synchronized only by the thread to which it was returned when initiating the collective. (ie the gasnet_coll_handle_t handle is thread-specific. This imposes a non-trivial synchronization requirement on a multi- threaded client when not using GASNET_COLL_ALL_THREADS. Specifically, the thread (A) elected to be the local representative for the purpose of initiating a given non-blocking collective must make independent progress towards syncing that collective and not be blocked waiting for some action by another local thread (B) which is contingent on B first completing the collective with a try-sync, as this dependency cycle would lead to deadlock. This requirement is user-visible and will need to be exposed in the non-blocking extensions to the UPC collective library). Handles are also node-specific - each node gets a separate handle_t for the operation, which it independently synchronizes in any order with respect to other outstanding collectives. Handles may not be passed across nodes. //=========================================================================== * Data segment For the present we want to implement only cases where the source and destination of the data movement collectives are in the GASNet segment. However, we also wish to allow a path for extension. So, we specify the following flags + *_DST_IN_SEGMENT -> If set, this bit is an assertion by the client that the destination address argument(s) to this collective operation lie in the GASNet segment. + *_SRC_IN_SEGMENT -> If set, this bit is an assertion by the client that the source address argument(s) to this collective operation lie in the GASNet segment. It is an error to set either of these flag bits when any portion of the corresponding memory lies outside the GASNet segment. The current specification *REQUIRES* that these bits are *BOTH* set, and thus currently only supports collective operations on data within the GASNet segment. This restriction should be relaxed in some future revision. //=========================================================================== * "Single-valued" arguments To implement things like Titanium's Exchange operation, we want a mechanism to deal with the case that not all nodes know all addresses. One approach would be to require the client to move addresses around to construct a call to the GASNet collectives which is "single-valued" in the sense used in the specification of the UPC collectives. However, there exist a number of implementation choices which would allow the GASNet conduit to perform collectives with only partial information on each node, and to do so more efficiently than having the client perform a gather_all just to collect the single-valued arguments. We also want to be able to use the fact that a client implementing the UPC collectives *will* provide us with single-valued arguments. The GASNet collectives are *NOT* limited to being called with single-valued arguments. We will have flag values indicating if address arguments to the collective initiation function are given in the UPC or Titanium style. + *_LOCAL -> The arguments provided by the local initiation call include correct local addresses, but not correct remote addresses. + *_SINGLE -> The arguments provided by the initiation calls on all nodes include correct local and remote addresses, and they agree. In the future we may wish to add *_GLOBAL, which would mix freely with other nodes specifying *_LOCAL. This would allow some nodes to know all addresses while others knew only local ones. We have no consumers for this at present, and thus don't specify it yet. If any node calls with the _SINGLE flags, then all nodes must do so, and all addresses must agree across all nodes. The 'nbytes', 'root' (for broadcast, scatter and gather) and synchronization mode must agree across all nodes regardless of the _LOCAL, _SINGLE flag. //=========================================================================== * Number outstanding There exists some concern that implementing synchronization using PUTS rather than AMs may require bounding the number of collective ops which are outstanding on the network and/or forcing ops to be completed (inside the conduit) in the order issued. It has been determined that none of this should be visible to the client: The interface will expose a limit of no less than 2^16-1. The implementation may internally use back-pressure flow control at a smaller limit, but initiation calls must always independently make progress without any synchronization action by the client on any node. Sync calls must be non-collective to support a number of important paradigms (eg spawn many MYSYNC collectives and compute on the result as trysync reports them to be complete). It's not even clear what it would mean to require a collective try sync on a MYSYNC or NOSYNC operation (where some nodes succeed and others fail). However, internally completing the collectives in order is acceptable when it's necessary. That still provides significantly more flexibility to the client than forcing the syncs to be collective. //=========================================================================== * Multiple Addresses To support a non-pthread UPC client with aligned segments requires the simplest form of addressing. Because the segments are aligned and shared arrays are aligned within their segments, a single address describes the data placement on all participating nodes. However, with unaligned segments one needs at least one address per node. To perform direct data placement with multiple threads per node one needs multiple addresses, with the alignment of segments determining if you need one for every application thread or as many addresses as threads per node. For Titanium clients we have the added wrinkle that a given node has no knowledge of the addresses on the others. Here is a summary of the five distinct cases that have been identified: + UPC or CAF w/ aligned segments, 1 thread/node (or multi-threaded w/ hierarchical data movement) + UPC or CAF w/ unaligned segments, 1 thread/node (or multi-threaded w/ hierarchical data movement) + UPC or CAF multi-threaded w/ direct data placement + Titanium, 1 thread/node (or multi-threaded w/ hierarchical data movement) + Ti multi-threaded w/ direct data placement The current proposal allows for all of these cases by offering two types of interface - one taking scalar address arguments and one taking lists of addresses. //=========================================================================== * Teams: Note that the major issues with adding teams to UPC are related to pointer arithmetic and memory allocation, neither of which affect GASNet. Therefore, the GASNet collectives are designed to be team-ready immediately (even if we delay implementation until we have a client). Therefore, all collectives take a team handle as the first argument. GASNET_TEAM_ALL is predefined and is analogous to MPI_COMM_WORLD. Looking for GASNET_TEAM_ALL we can identify non-team collectives at compile time and avoid paying any runtime cost for the team generality when it is not needed. To avoid deadlock when using team collectives, a client must follow certain ordering constraints. Informally, we require that for any pair of collectives all images which participate in both must agree on their ordering. Formally this may be expressed as follows. For any collective X let Participants(X) denote the set of images which are members of the corresponding team. For non-team collectives, Participants() is the set of all images across all the GASNet nodes. For image I, and collectives X and Y, let Precedes_I(X,Y) be a predicate true when X != Y and collective X is initiated before collective Y on the node hosting image I. For all collectives X and Y: For all images I and J in Intersection(Participants(X), Participants(Y)): Precedes_I(X,Y) = Precedes_J(X,Y). Note there is no ambiguity when X==Y or I==J An interface for the creation of teams is still an open issue. We also need to think about what thread serialization we require for calling collective initiation functions when images on a node participate in teams that do not encompass all the local images. Specifically, requiring client-level serialization for two team-collective initiation calls where the sets of locally participating images are disjoint can easily lead to deadlock, even for the simple case of blocking ALL/ALL collectives (because the collective ordering property does not apply across disjoint teams, so different nodes may invert the call ordering between independent teams). //=========================================================================== END OF NORMATIVE STUFF (MOSTLY AGREED TO) START OF OPEN ISSUES //=========================================================================== * OPEN ISSUE: Variable contributions What about calls where each node may contribute a different sized source to a gather or gather_all, or different sized dest to a scatter (with a proper constraint on the sums). This would be a natural(?) "v" extension to the current two-flavor proposal. * OPEN ISSUE: Progress In-flight collectives should be guaranteed to make progress, provided all nodes poll occasionally. Allowing them to make progress independent of polling activity would be nice, but is not a firm requirement. Non-blocking initiation calls are collective, so in some implementations (esp with IN_ALLSYNC) they may stall waiting for all nodes to arrive. This specifically means that it would be an error for a GASNet client to construct a dependency cycle where node A stalls before reaching the collective initiation, waiting for the result of some action that will be taken by node B when it exits the collective initiation. We need to formalize this requirement in the spec. * OPEN ISSUE: Co-Array FORTRAN (CAF) The CAF "spec" does not provide any library or intrinsic collective interfaces. However, the FORTRAN 95 array syntax is available and MIGHT express the collective operations (minus synchronization) in a manner that the compiler could easily translate into gasnet_coll_*() calls. Unfortunately, the lack of a defined interface encourages the calls like Exchange to be hand coded with staggered indexes to avoid communications bottle-necks. Such hand-optimized collectives are NOT good candidates for automatic translation by the complier. This fact is noted in an October 2003 paper from the group at Rice, which also state that they are working on a specification for CAF intrinsics for collective communication. CAF permits team syncs and team collectives. We need to look at that when trying to extend GASNet for teams. * OPEN ISSUE: Computational collectives (reduce, prefix reduce and sort). ** Data lifetime and alignment Like the other collectives we want bulk lifetimes for both data and metadata. Unlike the other collective operations, we require that the data be properly aligned for its type. At present the GASNet spec says that AM medium payload will have an alignment of 8-bytes or higher. However, we don't know if some hypothetical system could require a higher alignment for some data type of interest (128-bit types?). Therefore the library implementation should be parameterized by a constant value GASNET_DATA_ALIGNMENT which gives the alignment enforced by the computation collectives. This value will default to 8 (the AM Medium payload alignment), but at configuration time this could be increased on a given platform having stronger requirements. ** Client-provided (prefix) reduction function. GASNet computational collectives use an untyped interface in which the client provides the number and size of elements, but not a data type. GASNet has no built-in reductions for particular data types or operations, nor built-in sorts. GASNet does not assume a single-image environment and thus the address of the same function may differ across nodes. Therefore the GASNet implementation of the computational collectives is constrained to only invoke a function pointer on the same node which provided it. Functions are referenced by a handle type to hide this issue. Like AM handlers, GASNet does not ensure that the client-provided function is executed in the context of any particular thread, and may legally execute it in a thread internal to the implementation. GASNet also does not provide any serialization of these function invocations. However, the client does have control over whether these functions can be called from handler context. There is currently nothing in the UPC spec that would prohibit the app-provided function from accessing remote memory, or UPC thread-specific values/locations such as MYTHREAD or global-scope TLD. Therefore the library does not evaluate callback functions from AM handler context or conduit-internal thread unless we have assurance from the client that it is safe to do so. For this purpose, we include GASNET_COLL_AMSAFE flag, which may be included in the 'flags' argument when registering a function (see below). Inclusion of this flag is a hint to the GASNet implementation which permits it to evaluate the client-provided function (reduction, prefix-reduction, or comparison) in an AM context and/or using conduit-internal threads, which the implementation must never do without this flag. More specifically, all callbacks registered without the GASNET_COLL_AMSAFE flag are guaranteed to only run synchronously on an application thread in a context where it is legal to invoke any (non-collective) GASNet functions. In the case of blocked and block cyclic layouts one may potentially benefit from an interface which can pass an entire block. This interface has the advantage of placing the function call outside the loops. This means that we should be able to get compiler optimization of the inner reduction loops. This includes vectorization on some platforms, and things like loop unrolling, software pipelining and SIMD floating point on others. It is explicitly undefined how many times the callback functions will be invoked on behalf of a given collective or on which nodes it will be invoked - specifically, the callback may be invoked to calculate partial or redundant results, resulting in the reduction operator globally being evaluated a total number of times which exceeds the count one might expect from a naive translation of the reduction operation (ie more than total_num_elements - 1 evaluations). It is additionally undefined which data locations may be passed as input and output arguments to the reduction function - they may include a mix of references to the client data areas and intermediate buffers which may or may not reside in the GASNet segment. We keep a single signature for both commutative and non-commutative operators and for reduction and prefix-reduction, so that we can pass the function pointers through the same call. Note that a prefix reduction function, when called for exactly one output argument is semantically identical to a reduction. This provides important freedom to the GASNet implementation of a prefix reduction, which can freely perform reductions of intermediate results while only requiring the client to provide a single prefix-reduce function, and not a distinct function for reduction only. We can easily allow for full generality of the implementation by making no assumptions about the number of partial reductions which might be present at any stage of the computation. We may have zero or more operands which do not have corresponding outputs and will always have one or more operands which do have corresponding outputs. In the case of a non-commutative prefix reduction, we know that the operands which have corresponding outputs will always lie to the right of those without. In the case of a reduction we have exactly one output, but need not associate it with any particular input. Thus we require all client callbacks to support an arbitrary number of left-hand and a non-zero number of right-hand operands. This leads to a simple interface with two pointers to vectors of operands and two scalars giving the vector lengths. A third vector of outputs is also required, and will have the same length as the right-hand operand vector. This is the natural choice for a prefix-reduction since there are never "optional" right-hand operands. To allow client freedom in implementing (prefix) reduction functions, we pass the size of the data elements, the commutativity and a client-provided integer argument from the initiation function through to the invocation of the client-provided function. Following the ideas above leads to this prototype: /* For flags to the reduce_fn: */ #define GASNET_COLL_NONCOMM ??? typedef void (gasnet_coll_reduce_fn)(void *results, size_t result_count, const void *left_operands, size_t left_count, const void *right_operands, size_t elem_size, int flags, int arg); results: Output array This argument is the address of the space into which the result(s) should be written. result_count: Result count The number of results to be generated. Since there is a one-to-one correspondence between results and right-hand operands, this is also the length of the 'right_operands' array. The client may safely assume that a callback invoked on behalf of a (non-prefix) reduce function will always have result_count == 1. Otherwise, it will be >= 1. left_operands: Optional left-hand operand array This argument is the address of 0 or more operands which lie to the left of those given by the 'right_operands' argument. The number of operands in this array is given by the 'left_count' argument. If 'left_count' is zero then this argument is ignored. left_count: Left-hand operand count The number of operands in the 'left_operands' array. The client may safely assume that a callback invoked on behalf of a (non-prefix) reduce function will always have left_count > 0. No other guarantees are provided. right_operands: Right-hand operand array This argument is the address of 1 or more operands which lie to the right of those given by the 'left_operands' argument. The number of operands in this array is given by the 'result_count' argument. elem_size: Size of the operand and result elements. This argument gives the size of the data elements which are to be manipulated. flags: Flag values This argument gives information about the operation and is composed of a bitwise or of values. The value of 'flags' will include GASNET_COLL_NONCOMM if and only if the client passed this same flag to the gasnet_coll_init() function when registering this function. The remaining bits of 'flags' are reserved for future use and the client must not assume they will have any particular value. arg: Client-supplied argument This is the one argument passed into the gasnet_coll_init() function by the client when registering this function. Here is an example for a reduction (summation of integers), which takes (slight) advantage of commutativity to start with the right-hand operands, knowing that they must be non-empty: void summation_example( void *results, size_t result_count, const void *left_operands, size_t left_count, const void *right_operands, size_t elem_size, int flags, int arg) { int *left = (int *)left_operands; int *right = (int *)right_operands; int sum; int i; sum = right[0]; for (i = 1; i < right_count; ++i) sum += right[i]; for (i = 0; i < left_count; ++i) sum += left[i]; *(int *)results = sum; } Here is an example where we call a non-commutative function on doubles to perform a prefix reduction. Note that the application-provided function is passed "over" gasnet in the 'arg' which must be an index into a table, rather than an actual pointer. void noncomm_prefix_example(void *results, size_t result_count, const void *left_operands, size_t left_count, const void *right_operands, size_t elem_size, int flags, int arg) { double (*func)(double, double) = my_fn_tbl[arg]; double *output = (double *)results; double *left = (double *)left_operands; double *right = (double *)right_operands; double tmp; int i; /* The computation of the first result will depend on the presence * or absence of partial reductions from "left neighbors". */ if (left_count > 0) { /* Accumulate one or more partial reductions */ tmp = left[0]; for (i = 1; i < left_count; ++i) { tmp = (*func)(tmp, left[i]); } tmp = (*func)(tmp, right[0]); } else { tmp = right[0]; } /* Compute and store all the prefix reductions corresponding to the right-hand operands */ output[0] = tmp; for (i = 1; i < right_count; ++i) { tmp = output[i] = (*func)(tmp, right[i]); } } From these examples, one can see how this choice of interface can enable vectorization. ** UPC-imposed complications (data layout) It is not hard to deal with cyclic, block and block-cyclic layouts of data by implementing block-cyclic and allowing the other two to fall out as the two natural extremes. Using the generality allowed by the multiple-address data-movement collectives is enough to describe a block layout even for unaligned segments and multiple heaps per node. The inclusion of a scalar "element size" is already assumed by the choice of a untyped interface. The addition of one more scalar, "block size" allows us to express the block-cyclic layout. A value blk_sz==1 conveys pure-cyclic. The UPC interfaces have blk_size==0 to convey indefinite layout and that seems appropriate here as well, but more on that below. If UPC required phase==0 affinity=node0 for the source of a reduction (and dest of a prefix reduction), then indefinite layout would be the only corner case to deal with. However, the UPC collective spec allows arbitrary alignment of the input and output arrays. The first elements of the input need not have affinity to thread 0, nor is it required to have a phase of 0. On top of that, the prefix reduction does not require the input and output arrays to have the same alignment. This requires at least some level of generality in the interface. Rather than go to the full generality of variable contributions to deal with this, we can add just one more scalar. Let us assume we have an interface which allows expression of a block-cyclic layout, including an element size, number of elements per node (one scalar value), block size and the addresses of the contributions on each node. Then to deal with the non-aligned case we can pass a block-cyclic "bounding box" plus two additional scalars: the index of the first element to use (as input or output), and the total number of elements to use. The second of these need not be a new argument, but can instead can replace the element count which previously described the full "bounding box". The number of elements in the bounding box can be easily computed if it is ever needed. So, to express the input array requires the following 5-tuple: (elem_sz, elem_count, blk_sz, offset, address(es)) Where the addresses have the same freedom of expression as used in the data-movement collectives. When specifying the output array for a prefix reduction the elem_size and elem_count must be the same, so we only need the other three: blk_sz, offset and the address(es). Note that UPC specifies the prefix reduction to have the same blk_sz but allows for a different starting offset (affinity and phase). I propose to give gasnet the additional generality of different block sizes. Once indefinite layout is accounted for (below), this will allow prefix_reduce+gather and scatter+prefix_reduce to become single prefix reductions. I expect to support this reblocking in the sort as well. To allow for reblocking indefinite layout is something we need to express cleanly. As in UPC, we can use blk_sz=0 to express the fact that a given src or dst array has indefinite layout. However, we would seem to need an additional argument to indicate to which node the array has affinity. However, the offset needed for block-cyclic arrays is not needed for its original purpose. Therefore, the "offset" would be an image number for the indefinite case and would remain an element count for the other cases. ** Commutativity/Associativity In the UPC spec it is stated that all reduction ops are assumed associative and all are assumed commutative except UPC_NONCOMM_FUNC. The associativity assumption is consistent with MPI, which urges implementers to evaluate things in a predetermined order so that multiple runs produce identical results (with the same node/image layout), even in the presence of floating-point roundoff. I would reiterate this advice as advice to GASNet implementers as well. GASNet provides the GASNET_COLL_NONCOMM flag so the client may indicate that the client-provided (prefix) reduction function is non-commutative. The default, when this flag is not passed, is to assume a commutative operator. ** Registration of functions Because we cannot assume a single image environment we must deal with the possibility that a client-provided function may not have the same address on all nodes. To deal with this issue the function for a computational collective is passed as an integer valued handle. Similar to AM handler ids, these handles provide a mechanism to ensure that all the nodes can unambiguously refer to the same function on different nodes. Additionally, by creating these handles before they are referenced, we can safely invoke functions on nodes that have not yet entered an IN_NOSYNC collective. To ensure that these handles are instantiated on all nodes prior to their use, the creation of these handles must be collective. We envision the set of client-provided functions as those defined by some language or library specification. Therefore registration is static, performed in the gasnet_coll_init() call. These handles are "global". Once registered on a given node, the same handle can be used by any thread on that node and can be safely passed between nodes. The same handle table is used for the functions passed to reduction, prefix reduction and (when implemented) sort collectives. The reduction and prefix reduction take functions with the same signature, while sort does not. Use of a function handle in the "wrong type" of collective call has undefined results. Some functions are safe to call from an AM handler context while others are not. Some reduction functions are commutative while others may not be. Both of these properties are considered to be features of the functions, rather than the collective calls in which they are used. Therefore there are flags to the registration of a function. The corresponding flags are GASNET_COLL_AMSAFE and GASNET_COLL_NONCOMM. The affect of specifying the NONCOMM flag for a sort function is undefined. Unlike the registration of AM handlers, we are not permitting the client to specify independent handle values. The handles are simply the position of each function in the handler table, and are not explicitly returned from the registration operation. To implement application-provided functions requires a level of indirection. Because one cannot typically know how many distinct functions will be invoked, the GASNet-client will need to manage a table. If one imagines keeping a simple FIFO or LRU cache of the application-provided functions, then an index into that cache can be used as the single-valued argument to the GASNet collective. Ensuring that the cache entry is set correctly on all nodes before first use can be as simple as promoting IN_NOSYNC to IN_MYSYNC for any collective call which references a newly installed entry. Note that in this scenario the GASNet-client will also need to track in-flight collectives that use app-provided functions to ensure that a cache entry is never evicted while in use (unbounded tables would eliminate eviction, but may still be undesirable). Once GASNet and UPC grow real teams, this client table will likely need to be managed per-team. * OPEN ISSUE: Teams We need a representation for a team. Once constructed, we want a team to be an immutable object. Thus an opaque handle of some sort is appropriate. We will need an unambiguous way for the implementation to refer to the teams on the wire. Therefore we anticipate providing handles which will have the same bitwise representation across all team members. (Note that this does not prevent the same bitwise value from simultaneously denoting another team with a disjoint node membership; a property that allows the id to be generated by a reduction over the member nodes if desired.) Either way, we need to be clear about what guarantees (if any) are provided to the client regarding the team handles - specifically, can they safely be passed across threads and/or nodes, and can they be compared for equality to test team equality. We can imagine needing to access a team in at least three different ways. The simplest is to ask "does image I belong to team T?" With a vector of image numbers we can perform this test in O(|T|) time, or in O(log(|T|)) if the vector is sorted. If we have a bitmap representation then we can perform the is_member() test in O(1) time. Another is a potential need to iterate over the members of a team. For a team with |T| << P, we'd like to perform the iteration in O(|T|). The image vector implementation allows for a O(|T|) implementation of iteration (total time for all |T| iterations). We may also need a mapping between global image number and a "rank" within a team. The image-vector representation makes the rank->image mapping trivial. The inverse mapping will require either O(I) additional storage to perform the mapping in O(1) time, or could be done by binary search within the image vector on O(log(|T|)) time. By using an opaque identifier for teams, we can internally keep both the image vector representation and a bitmap (and the image->rank mapping array if desired). //=========================================================================== END OF OPEN ISSUES START OF CURRENT PROPOSAL //=========================================================================== There are two flavors of interface to each collective. The simple addressing case moves an equal amount of contiguous data to or from each node, with the exception of the root node in the asymmetric collectives (broadcast, scatter and gather). This type of interface takes a single src address and a single dst address. This is suitable for the case of UPC collectives with aligned segments (and a single UPC thread per node or hierarchical movement of data for multiple threads), and also suitable for Titanium-style collectives in which each node may have a distinct source or destination, but lacks knowledge of the other addresses. The second flavor is the multiple-address interfaces, having an "m" suffix. Like the gasnet_{get,put}i() family of interfaces, the multiple-address interfaces take lists of sources and destinations. This allows for the cases not covered by the basic interface. If a UPC runtime has unaligned segments, then the list could specify exactly one address per node with the freedom for them to differ. If there are multiple UPC or Titanium threads in the address space of a single GASNet node, then the lists could provide multiple addresses per node to avoid the need for hierarchical movement of data. Note that this interface is designed for a static number of images, such as one per language-level thread, and not for varying the size of the contribution from each node dynamically. Common to all these interfaces is the presence of either GASNET_COLL_LOCAL or GASNET_COLL_SINGLE in the flags argument. In the SINGLE case we have UPC-style arguments in which the addresses are known (and thus passed in) on all nodes. In the LOCAL case we have Titanium-style arguments in which only the local addresses are known, and passed in. Note that when SINGLE is given, then all arguments are "single valued" providing identical information on all nodes (for arrays the contents of the arrays must be identical, not their addresses). When LOCAL is given the address arguments may differ among the nodes while all remaining arguments are "single valued". For instance the nodes must all agree on the "root" of a broadcast, scatter or gather, even though only the root node will provided its address. Of course all nodes must agree on the nbytes and the flags. Operation of any collective in which the arguments fail to agree as required will be undefined. In the event of such a user error, high-quality implementations are encouraged to issue an explanatory fatal error message if debugging is enabled or if checking for agreement does not require additional communication. For the multiple-address interfaces, there is additional metadata required to indicate how many images exist on each node. Since these interfaces are designed for this metadata to be static, it is provided exactly once, through the gasnet_coll_init() function. This function must be called by clients before they may invoke any other gasnet_coll_* function (even if they never use the multiple-address interfaces). It is therefore a collective operation, called by all nodes in the same order (always first) among collectives. It is also subject to the constraint that the arguments agree across nodes. Unlike other collectives the _init call MUST be made collectively by every thread on a node. This is a blocking call. This function may be called at most once by each thread and may only be called between gasnet_attach() and gasnet_exit(). The 'images' argument to gasnet_coll_init() is an array of gasnet_image_t's, having exactly gasnet_nodes() elements. The ith element of this array indicates the number of images present on node i. Images are analogous to language-level threads in all the clients we envision, but this need not be the case. The images are always given in "node-major" order, meaning that the blocks of data in a gather operation (for instance) include all the images from node 0, followed by all the images from node 1, etc. (consistent with the default thread layout of all the runtimes we examined). All elements of the 'images' array must be non-zero. For use in the descriptions that follow, define NImages := SUM(i=0..gasnet_nodes(), images[i]). NImages must be less than or equal to SSIZE_MAX (as defined by POSIX). A NULL pointer for the 'images' argument indicates a single image per GASNet node. In a GASNET_SEQ configuration, 'images' must be NULL. If 'images' is not NULL, then 'my_image' argument to gasnet_coll_init() gives the image number of the calling thread. Otherwise it is ignored. There is a 'flag' value of GASNET_COLL_AGGREGATE which can be used to request that consecutive non-blocking collective initiation calls be aggregated into a single collective operation with a single handle. It is illegal to pass GASNET_COLL_AGGREGATE to a blocking collective initiation call. To simplify the definitions, we'll consider a single collective to be a trivial one-element aggregate. A collective initiation call is the First member of an aggregate if either the previous collective initiation call for the current team did not have GASNET_COLL_AGGREGATE in the 'flags', or if it is the first collective initiation call for the current team. A collective initiation call is the Last member of an aggregate if it does not have GASNET_COLL_AGGREGATE in the 'flags'. With these definitions, a sequence of initiation calls all without the GASNET_COLL_AGGREGATE flag are each trivially the First and Last members of their own aggregates and each has a distinct handle. A sequence of initiation calls on a given team with the GASNET_COLL_AGGREGATE flag set form a multiple-member aggregate, together with the first following call without GASNET_COLL_AGGREGATE. The handle returned by any collective initiation call with GASNET_COLL_AGGREGATE in the flags is undefined. If a client attempts to sync on a handle returned by such a call then the behavior is undefined. If the Last member of an aggregate is a non-blocking initiation call, then it will return a valid handle, successful synchronization of which indicates that all members of the aggregate have completed to the extent required by their synchronization flags. If the Last member of an aggregate is a blocking call then it will not return until all members of the aggregate have completed to the extent required by their synchronization flags. GASNet non-blocking collectives have bulk data lifetime semantics regardless of the use of aggregation. The lifetime of any given buffer starts with the initiation call describing that buffer, and is not deferred until the Last member of the aggregate. It is permitted to make other GASNet calls between the First and Last calls of an aggregate, excepting those calls that would be prohibited even in the absence of aggregation, such as the gasnet_barrier_* calls. However, since some implementations of aggregation may delay the start of communication until the Last member is seen, use of any potentially time-consuming operations between the First and Last collective initiation calls is discouraged. Note to implementers: The description above allows for an implementation which does nothing but aggregate the handles in the spirit of an nbi access region, starting communications just as it would in the absence of aggregation. * PROTOTYPES: typedef void (*gasnet_coll_fn_t)(); typedef struct { gasnet_coll_fn_t fnptr; unsigned int flags; } gasnet_coll_fn_entry_t; /* gasnet_coll_init: Initialize GASNet collectives * * images: Array of gasnet_nodes() elements giving the number of * images present on each node. This must have the * same contents on all callers or the behavior is undefined. * If NULL, then there is one image per node. * In GASNET_SEQ mode, NULL is the only legal value. * my_image: If 'images' is non-NULL, this gives the image number of * the calling thread. These image numbers are zero-based, * always assigning image 0 to node 0. Image numbers on a * given node are consecutive, and increase as node numbers * increase. * fn_tbl: An array of type gasnet_coll_fn_entry_t, specifying * the functions which can be invoked for the * computational collectives. This may safely differ * in fnptr contents (but not flags or size) across nodes, * but must agree across all images on a given node. * Multiple images on the same node may safely pass the * same pointer value, but this is not required. * The client may safely reuse the associated memory after * this call has returned (in all local images if the same * pointer was used across them). * fn_count: The number of entries in 'fn_tbl'. Must agree across * all callers or the behavior is undefined. * init_flags: Presently unused. Must be 0. */ void gasnet_coll_init(const gasnet_image_t images[], gasnet_image_t my_image, const gasnet_coll_fn_entry_t fn_tbl[], size_t fn_count, int init_flags); /* Argument conventions for data movement collectives: * In the prototypes that follow the argument names are used * consistently, so the following definitions apply to all arguments * with the given name: * * team: Team handle * This argument is an opaque handle of type gasnet_team_handle_t. * This provides for an extension of the GASNet collectives to * support teams. Presently this argument must have the value * GASNET_TEAM_ALL. * flags: Flags * The value of this argument must be equal across all callers or * the behavior is undefined. * This argument supplies flags which determine how the operation * will be performed. This includes information about the * synchronization required and the scope of addresses. * Input synchronization flags are: * + GASNET_COLL_IN_NOSYNC: All data movement may begin on any * GASNet node as soon as any caller has entered the collective * initiation function. * + GASNET_COLL_IN_MYSYNC: Data movement in or out of the client memory * of a given image may begin no earlier than the entry to * the collective initiation function by that image. * + GASNET_COLL_IN_ALLSYNC: Data movement may begin only when * all participating images have entered the collective * initiation function. * Output synchronization flags are: * + GASNET_COLL_OUT_NOSYNC: The sync of a collective handle may * succeed at any time so long as the last thread to sync does * not do so until all data movement has completed. * + GASNET_COLL_OUT_MYSYNC: The sync of a collective handle may * succeed as soon as all data movement has completed to and from * local client data areas specified in the call (note that movement to * or from buffers internal to the implementation and local to the * node might still be taking place, for instance in tree-based * broadcasts). * + GASNET_COLL_OUT_ALLSYNC: The sync of a collective handle may * succeed as soon as all data movement has completed to and from * all client data areas specified in the call. This is weaker than a * full barrier. For instance, the equivalent of a zero-byte * broadcast is sufficient synchronization in the case of a rooted * operation (broadcast, scatter and gather). * Addressing scope flags are: * + GASNET_COLL_SINGLE: This indicates UPC-type addressing * in which every caller provides addresses for the source(s) and * destination(s) on all participating images, and these arguments * agree. * + GASNET_COLL_LOCAL: This indicates Titanium-type addressing * in which each caller provides only the local address(es). * See GASNET_COLL_ALL_THREADS below, for more information. * Defaults: * There are no default values for these three categories of flags. * The client must provide exactly one flag value from each of the * three categories above. * Aggregation: * If present, the flag GASNET_COLL_AGGREGATE indicates that a * sequence of non-blocking collective initiation calls should be * aggregated together into a operation with a single handle. * Callers: * If present, the flag GASNET_COLL_ALL_THREADS indicates that all * threads on each node will call this function collectively. In this * case the input and output synchronization flags are applied at a * per-image granularity, where the thread-to-image correspondence is * that established by the call to gasnet_coll_init(). Combining * this flag with GASNET_COLL_LOCAL means that the address(es) * provided by the caller are local to the calling image only. * If this flag is absent, then one representative thread from each * node will call this function, and the synchronization flags will be * applied to the union of all images on each node. Without this * flag, use of GASNET_COLL_LOCAL means that the address(es) provided * by the caller are those of all images on the local node. * In GASNET_SEQ mode, this flag is legal, but meaningless. * In GASNET_PARSYNC mode, this flag is illegal. * In GASNET_PAR mode, this flag is legal and meaningfull. * * nbytes: Byte count * This is the number of bytes to be transfered per source block. * The value must be greater than zero. * The value of this argument must be equal across all callers or * the behavior is undefined. * * {dst,src}image: Destination/Source image. * This argument specifies the root image of an asymmetric collective * operation: Broadcast, Scatter or Gather. * The value of this argument must be equal across all callers or * the behavior is undefined. In a GASNET_SEQ configuration, the * image number is identical to a node number. Otherwise, the * image numbering is the one established by the 'my_image' arguments * to gasnet_coll_init(). * * {dst,src}: Destination (Source) address. * When GASNET_COLL_ALL_THREADS is not specified in the 'flags' the * one calling thread on each node represents all of the threads on * the same node. Therefore, the term "root image/node" will be used * here to refer either to the root image (when GASNET_COLL_ALL_THREADS * is given) or the one calling thread on the same node as the given * root image otherwise. * * For Broadcast and Scatter the source address is only applicable * to the root image/node. For the Gather operation the destination * address is only applicable to the root image/node. For the remaining * operations both destination and source apply on all images. * * In the SINGLE case this argument is given an identical value * by all callers. This value is used as the destination (source) * address on all images where the argument is applicable. Where a * given argument is not applicable to the given image the identical * value must still be provided. * * In the LOCAL case this argument may be passed independent values * by each caller. Each caller's value is used as it's destination * (source) address where applicable. Where a given argument is * not applicable to the given caller the argument is ignored. * * {dst,src}list: Destination (Source) address list. * In the SINGLE case this argument is an array of NImages addresses. * The contents of this array must be identical across all callers. * The ith address is used as the destination (source) address of the * ith image. The correspondence between images and GASNet nodes is * the one given by the gasnet_coll_init() call. * * In the LOCAL case this argument may be passed independent values * on each node. On node i, this argument is an array of images[i] * local addresses, where images[] is the argument passed to * gasnet_coll_init(). * * If the arguments to a multiple-address interface specify data * movement involving two blocks from images coexisting on a * single node such that the blocks actually overlap in memory, * the behavior is undefined (even if both are source blocks). * Except where otherwise noted, if any of the source areas overlap * any of the destination areas, the behavior is undefined. * * The following are the explicit exceptions to the above rule concerning * overlap of source and destination areas, where "identical" means * having the same values of node, address and length. They have in common * that omitting a copy between matching source and destination blocks * would still result in correct results. * + Broadcast: * The source block may legally be identical to any one of the * destination blocks. * + Scatter, Gather, Gather_all and Exchange: * The i'th block of the source may legally be identical to the * i'th block of the destination. * + Exchange * There are no exceptions for the exchange operation. * * There are important interactions to be aware of between the flags * GASNET_COLL_LOCAL and GASNET_COLL_ALL_THREADS: * * Calling the single-address interfaces with GASNET_COLL_LOCAL in * the 'flags' but without GASNET_COLL_ALL_THREADS is NOT permitted * when there are multiple images on any node. This is the case * because there would be no way to obtain the addresses for the * non-calling images on such nodes. * * Calling the multiple-address interfaces with GASNET_COLL_LOCAL * set in the 'flags' is NOT permitted in a GASNET_SEQ configuration * or in combination with GASNET_COLL_ALL_THREADS. This is because * there can only ever be one address per caller in these cases and * identical behavior can therefore be achieved less expensively via * the single-address interfaces. */ gasnet_coll_handle_t gasnet_coll_broadcast_nb( gasnet_team_handle_t team, void *dst, gasnet_image_t srcimage, void *src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_broadcastm_nb( gasnet_team_handle_t team, void * const dstlist[], gasnet_image_t srcimage, void * src, size_t nbytes, int flags); void gasnet_coll_broadcast( gasnet_team_handle_t team, void *dst, gasnet_image_t srcimage, void *src, size_t nbytes, int flags); void gasnet_coll_broadcastm( gasnet_team_handle_t team, void * const dstlist[], gasnet_image_t srcimage, void * src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_scatter_nb( gasnet_team_handle_t team, void *dst, gasnet_image_t srcimage, void *src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_scatterm_nb( gasnet_team_handle_t team, void * const dstlist[], gasnet_image_t srcimage, void * src, size_t nbytes, int flags); void gasnet_coll_scatter( gasnet_team_handle_t team, void *dst, gasnet_image_t srcimage, void *src, size_t nbytes, int flags); void gasnet_coll_scatterm( gasnet_team_handle_t team, void * const dstlist[], gasnet_image_t srcimage, void * src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_gather_nb( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void *src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_gatherm_nb( gasnet_team_handle_t team, gasnet_image_t dstimage, void * dst, void * const srclist[], size_t nbytes, int flags); void gasnet_coll_gather( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void *src, size_t nbytes, int flags); void gasnet_coll_gatherm( gasnet_team_handle_t team, gasnet_image_t dstimage, void * dst, void * const srclist[], size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_gather_all_nb( gasnet_team_handle_t team, void *dst, void *src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_gather_allm_nb( gasnet_team_handle_t team, void * const dstlist[], void * const srclist[], size_t nbytes, int flags); void gasnet_coll_gather_all( gasnet_team_handle_t team, void *dst, void *src, size_t nbytes, int flags); void gasnet_coll_gather_allm( gasnet_team_handle_t team, void * const dstlist[], void * const srclist[], size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_exchange_nb( gasnet_team_handle_t team, void *dst, void *src, size_t nbytes, int flags); gasnet_coll_handle_t gasnet_coll_exchangem_nb( gasnet_team_handle_t team, void * const dstlist[], void * const srclist[], size_t nbytes, int flags); void gasnet_coll_exchange( gasnet_team_handle_t team, void *dst, void *src, size_t nbytes, int flags); void gasnet_coll_exchangem( gasnet_team_handle_t team, void * const dstlist[], void * const srclist[], size_t nbytes, int flags); /* Argument conventions for computation collectives: * In the prototypes that follow the argument names are used * consistently, so the following definitions apply to all arguments * with the given name except for the destination arguments as noted: * * Destination arguments for Reduction * The result of a reduction is a single data element of the size * given by the 'elem_size' argument. * dstimage: Destination Image * This argument gives the destination image for a reduction. The * value of this argument must be equal across all callers or the * behavior is undefined. In a GASNET_SEQ configuration, the * image number is identical to a node number. Otherwise, the * image numbering is the one established by the 'my_image' arguments * to gasnet_coll_init(). * dst: Destination Address * When GASNET_COLL_ALL_THREADS is not specified in the 'flags' the * one calling thread on each node represents all of the threads on * the same node. Therefore, the term "destination image/node" will * be used here to refer either to the destination image (when * GASNET_COLL_ALL_THREADS is given) or the one calling thread on the * same node as the given destination image otherwise. * * This argument gives the address at which the result will be * placed on the destination image/node. For GASNET_COLL_LOCAL the * 'dst' address is ignored on all callers except the destination * image/node. For GASNET_COLL_SINGLE the value of this argument * must be equal across all callers or the behavior is undefined. * * Destination arguments for Prefix Reduction * The result of a prefix reduction is an array of 'elem_count' * elements, each of the size given by the 'elem_size' argument. * dst, dst_blksz, dst_offset: Destination (prefix_reduce) * These arguments give the address or addresses of the destination * array for the single-address interfaces. See the descriptions * of the arguments 'src', 'src_blksz' and 'src_offset' for the * interpretation of this triplet as a specification of a * distributed array. * dstlist, dst_blksz, dst_offset: Destination address(es) (prefix_reducem) * These arguments give the address or addresses of the destination * array for the multiple-address interfaces. See the descriptions * of the arguments 'srclist', 'src_blksz' and 'src_offset' for the * interpretation of this triplet as a specification of a * distributed array. * * * Common Arguments * team: Team handle * This argument is an opaque handle of type gasnet_team_handle_t. * This provides for an extension of the GASNet collectives to * support teams. Presently this argument must have the value * GASNET_TEAM_ALL. * src: Source address(es) * This argument gives the address or addresses of the source * array for the single-address interfaces. The array holds * 'elem_count' data items, each of size 'elem_size' bytes. The * layout of elements across the images depend on additional * arguments. When 'src_blksz' is zero, we have indefinite * layout and all 'elem_count' data elements reside on a single * image, given by 'src_offset'. When 'src_blksz' is non-zero, * the data has a block-cyclic layout consisting of at least * ('elem_count'+'src_offset') elements with a block size of * 'src_blksz' elements per block. * When GASNET_COLL_SINGLE is included in the 'flags' argument, * the same value must be passed by all callers, and is used as the * starting address of the array on all nodes (when 'src_blksz' * is non-zero) or exactly one node (when 'src_blksz' is zero). * When GASNET_COLL_LOCAL is included in the 'flags' argument, * the value passed by each caller is independent and provides only * the address of the array on the given image. * srclist: Source address(es) * This argument replaces 'src' for use in the multiple-address * interfaces. When GASNET_COLL_SINGLE is included in the * 'flags' argument, this argument is an array of NImages * addresses, which must be identical across all callers. The ith * address is used as the address of the ith image, where the * correspondence between images and GASNet nodes is the one * given by the gasnet_coll_init() call. In the LOCAL case this * argument may be passed independent values by each caller and on * node i, is an array of images[i] local addresses, where * images[] is the argument passed to gasnet_coll_init(). * src_blksz: Source blocksize * This argument gives the blocksize of the data layout for the * array. When equal to zero, 'src_blksz' denotes indefinite * layout and 'src_offset' gives the index of the corresponding * node or image. Otherwise, this is the blocksize, in units of * elements ('elem_size' bytes), of a block-cyclic layout. The * value of this argument must be equal across all callers or the * behavior is undefined. * src_offset: Source offset * This argument gives the offset to the first data element, * relative to the base of the distributed array defined by the * 'src' or 'srclist' and 'src_blksz' arguments. When * 'src_blksz' is zero the offset is in units of images, and * gives the index of the image holding the data. For the * single-address interfaces images are synonymous with nodes, * while for the multiple-address interfaces the node/image * correspondence is that given by the gasnet_coll_init() call. * When 'src_blksz' is non-zero the offset is in units of data * elements ('elem_size' bytes) and follows the ordering of the * block-cyclic layout. The value of this argument must be equal * across all callers or the behavior is undefined. * Except where otherwise noted, if any of the source areas overlap any of * the destination areas, the behavior is undefined. * If the arguments to a multiple-address interface specify data * movement involving two blocks from images coexisting on a * single node such that the blocks actually overlap in memory, * the behavior is undefined (even if both are source blocks). * elem_size: Element size * This argument gives the size, in bytes, of the data elements * to be reduced, and must be equal across all callers or the * behavior is undefined. * elem_count: Element count * This argument gives the number of elements to be reduced, and * must be equal across all callers or the behavior is undefined. * func, func_arg: Reduction function and argument * These arguments define the client-provided function which will * perform the reduction or prefix reduction and its one client- * provided argument. The 'func' argument is an integer index * into the fn_tbl table passed to gasnet_coll_init(). * Both arguments must have uniform values across all callers or * the behavior is undefined. Passing of pointers in 'func_arg' * is unsafe both because a pointer may be larger than an int, * and because the single-valued constraint may not be * satisfiable for pointers on some platforms. * flags: Flags argument * The value of this argument must be equal across all callers or * the behavior is undefined. This argument gives a bitwise OR * of flags which determine how the operation will be performed * Flags are identical to those defined above for the data * movement collectives. * * There are important interactions to be aware of between the flags * GASNET_COLL_LOCAL and GASNET_COLL_ALL_THREADS: * * Calling the single-address interfaces with GASNET_COLL_LOCAL in * the 'flags' but without GASNET_COLL_ALL_THREADS is NOT permitted * when there are multiple images on any node. This is the case * because there would be no way to obtain the addresses for the * non-calling images on such nodes. * * Calling the multiple-address interfaces with GASNET_COLL_LOCAL * set in the 'flags' is NOT permitted in a GASNET_SEQ configuration * or in combination with GASNET_COLL_ALL_THREADS. This is because * there can only ever be one address per caller in these cases and * identical behavior can therefore be achieved less expensively via * the single-address interfaces. */ gasnet_coll_handle_t gasnet_coll_reduce_nb( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void *src, size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); gasnet_coll_handle_t gasnet_coll_reducem_nb( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void * const srclist[], size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); void gasnet_coll_reduce( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void *src, size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, gasnet_int func, int func_arg, int flags); void gasnet_coll_reducem( gasnet_team_handle_t team, gasnet_image_t dstimage, void *dst, void * const srclist[], size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, gasnet_int func, int func_arg, int flags); gasnet_coll_handle_t gasnet_coll_prefix_reduce_nb( gasnet_team_handle_t team, void *dst, size_t dst_blksz, size_t dst_offset, void *src, size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); gasnet_coll_handle_t gasnet_coll_prefix_reducem_nb( gasnet_team_handle_t team, void * const dstlist[], size_t dst_blksz, size_t dst_offset, void * const srclist[], size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); void gasnet_coll_prefix_reduce( gasnet_team_handle_t team, void *dst, size_t dst_blksz, size_t dst_offset, void *src, size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); void gasnet_coll_prefix_reducem( gasnet_team_handle_t team, void * const dstlist[], size_t dst_blksz, size_t dst_offset, void * const srclist[], size_t src_blksz, size_t src_offset, size_t elem_size, size_t elem_count, int func, int func_arg, int flags); //=========================================================================== END OF CURRENT PROPOSAL START OF IMPLEMENTATION NOTES //=========================================================================== * Thoughts on progress of implementation: While not part of the GASNet spec for users, it is important that we have a good design for how the reference implementation will interact with the individual conduits. In particular I see at least three ways in which a conduit would wish to make progress on collectives: 1) Simple polling, for instance from gasnet_AMPoll(). This would be what we include in template-conduit as well as pure-polling implementations such as mpi-conduit. 2) Blocked thread. A conduit may wish to create an extra thread just for the purpose of making progress on collectives. For this reason it would be nice to have something like a condition variable on which such a thread could block until there was something to do. 3) Interrupts. A conduit such as LAPI may be able to switch to a different operating mode in which the application is interrupted when network traffic arrives. This would make it more responsive to the network for the purpose of advancing a collective operation but the added overhead is not desirable for normal operation. So, one might wish to enter interrupt mode when one or more collectives are "live" and return to polled mode when none are live. There is similarity between #2 and #3 that argues for a general approach that they can each plug into. In fact, #1 can be implemented over the same generalization as a part of the template-conduit. The general solution is to have hooks (provided by the conduit) which are called at each entry to a collective initiation function and when the collective is completed (probably in the "kick" function called by gasnete_poll()). /* Routine to make progress on collectives (if any): */ extern void gasnete_poll(void); /* Conduit-specific hooks for counting live collectives: */ extern void gasnete_collective_entry_hook(void); extern void gasnete_collective_leave_hook(void); Here are versions of all three designs (not guaranteed to compile). I would put these all into the template-conduit: 1) Simple polling: Implement the following in gasnet_extended_fwd.h: extern gasneti_atomic_t gasnetc_collective_count; GASNET_INLINE_MODIFIER(gasnete_collective_entry_hook) void gasnete_collective_entry_hook(void) { gasneti_atomic_increment(&gasnetc_collective_count); } GASNET_INLINE_MODIFIER(gasnete_collective_leave_hook) void gasnete_collective_leave_hook(void) { gasneti_atomic_decrement(&gasnetc_collective_count); } #define GASNETE_POLL_IF_NEEDED() \ if_pf (gasneti_atomic_read(&gasnetc_collective_count) != 0) \ { gasnete_poll(); } In the core, one would then call GASNETE_POLL_IF_NEEDED() in gasnetc_poll(), at least in template-conduit, and perhaps at other places where it could be helpful (such as right after an AM handler is run). 2) Blocking thread: To do this we just need to add a little to the case above. Note that we are about to add condition variables to gasnet. They should be a natural wrapper around pthread condition variables, taking a hsl in place of a pthread mutex. Note also that a handler may NEVER wait on a condition variable, only signal it. Prototypes for gasneti_cond* appear in the next section. Implement the following in gasnet_extended_fwd.h: extern gasneti_atomic_t gasnetc_collective_count; extern gasnet_hsl_t gasnetc_collective_hsl; extern gasneti_cond_t gasnetc_collective_cond; GASNET_INLINE_MODIFIER(gasnete_collective_entry_hook) void gasnete_collective_entry_hook(void) { gasneti_atomic_increment(&gasnetc_collective_count); gasneti_assert(gasneti_atomic_read(&gasnetc_collective_count) >= 1); gasnet_hsl_lock(&gasnetc_collective_hsl); gasneti_cond_signal(&gasnetc_collective_cond); gasnet_hsl_unlock(&gasnetc_collective_hsl); } GASNET_INLINE_MODIFIER(gasnete_collective_leave_hook) void gasnete_collective_leave_hook(void) { gasneti_atomic_decrement(&gasnetc_collective_count); } In the core, the thread would use something such as: void gasnetc_collective_poll_thread(void) { while (!EXITING()) { /* Note the "outer" test to reduce lock traffic. * In a race this can only cause extra calls to gasnete_poll(). * Such extra calls are safe, but missing a wakeup is not. */ if (gasneti_atomic_read(&gasnetc_collective_count) == 0) { while (!EXITING() && gasneti_atomic_read(&gasnetc_collective_count) == 0) { gasnet_hsl_lock(&gasnetc_collective_hsl); gasneti_cond_wait(&gasnetc_collective_cond, &gasnetc_collective_hsl); gasnet_hsl_unlock(&gasnetc_collective_hsl); } } if_pt (!EXITING()) { gasnete_poll(); } // TODO: need a throttling mechanism here so that this thread // doesn't starve the working threads (which it may be waiting on) // for CPU and lock resources } gasnetc_exit(0); } 3) Interrupt mode Implement the following in gasnet_extended_fwd.h: extern unsigned gasnetc_collective_count; extern gasnet_hsl_t gasnetc_collective_hsl; GASNET_INLINE_MODIFIER(gasnete_collective_entry_hook) void gasnete_collective_entry_hook(void) { gasnet_hsl_lock(&gasnetc_collective_hsl); gasnetc_collective_count++; if (gasnetc_collective_count == 1) { ENTER_INTERRUPT_MODE(); } gasnet_hsl_unlock(&gasnetc_collective_hsl); } GASNET_INLINE_MODIFIER(gasnete_collective_leave_hook) void gasnete_collective_leave_hook(void) { gasnet_hsl_lock(&gasnetc_collective_hsl); gasnetc_collective_count--; if (gasnetc_collective_count == 0) { LEAVE_INTERRUPT_MODE(); } gasnet_hsl_unlock(&gasnetc_collective_hsl); } It might be tempting to merge the common ideas in #2 and #3 in to an "edge triggered" pair of hooks, one called on the 0->1 transition and the other on the 1->0 transition. However, this would not work very cleanly with the condition variable unless the hsl used to trigger the hooks was exposed for use in the waiting thread. I dislike that as an unclear interface, and the code duplicated between #2 and #3 is too small to justify such a design. //=========================================================================== * Proposal for gasneti_cond_t: In order to permit a conduit to block waiting for something that might happen in another thread, particularly in a handler executed in another thread, we want something like pthread_cond_t, but which takes an hsl in place of a pthread_mutex_t. Note that these are for use INTERNALLY, and we don't want to add them to the GASNet spec where client code might use them. (At least *I* don't want to do so) Here is an attempt (not run through the compiler) to implement pthread_cond_t in terms of our "reference" hsl, which is just a struct holding a gasneti_mutex_t: typedef struct _gasneti_cond_t { pthread_cond_t cond; } gasneti_cond_t; #define GASNETI_COND_INITIALIZER { PTHREAD_COND_INITIALIZER } #define gasneti_cond_init(P) pthread_cond_init(&((P)->cond)) #define gasneti_cond_destroy(P) pthread_cond_destroy(&((P)->cond)) #define gasneti_cond_signal(P) pthread_cond_signal(&((P)->cond)) #define gasneti_cond_wait(P,H) gasneti_cond_wait_mutex(P, &((H)->lock)) extern void gasneti_cond_wait_mutex(gasneti_cond_t *cond, gasneti_mutex_t *lock); void gasneti_cond_wait_mutex(gasneti_cond_t *cond, gasneti_mutex_t *lock) { #if !GASNETI_USE_TRUE_MUTEXES gasneti_fatalerror("There's only one thread: waiting on condition variable => deadlock"); #endif #if GASNET_DEBUG gasneti_assert(lock->owner == GASNETI_THREADIDQUERY()); lock->owner = (uintptr_t)GASNETI_MUTEX_NOOWNER; pthread_cond_wait(&(cond->cond), &(lock->lock)); gasneti_assert(lock->owner == (uintptr_t)GASNETI_MUTEX_NOOWNER); lock->owner = GASNETI_THREADIDQUERY(); #else pthread_cond_wait(&(cond->cond), lock); #endif } //=========================================================================== * Notes on (non-)commutative (prefix) reduction operators. Imagine X # Y # Z, where # is some operator implemented by f(.,.). Let C={True,False} denote commutativity C = False: Freedom to change grouping, but not order. Only legal options are f(f(X,Y), Z) and f(X, f(Y,Z)) C = True: Freedom to change grouping and order, allowing 12 options: f(f(X,Y),Z) f(X,f(Y,Z)) f(f(X,Z),Y) f(X,f(Z,Y)) f(f(Y,X),Z) f(Y,f(X,Z)) f(f(Y,Z),X) f(Y,f(Z,X)) f(f(Z,X),Y) f(Z,f(X,Y)) f(f(Z,Y),X) f(Z,f(Y,X)) For operators which are non-commutative (i.e. matrix multiply on non-square matrices) we can perform partial reductions over the blocks in parallel, but cannot reduce multiple blocks per row on a given node w/o communication. The communication that IS required for a reduction can be done in O(log(P)) stages which moves n_elems/blk_sz/P elements per node in each stage. [This is not the only way to do it.] For operators which are commutative we can reduce the multiple blocks on a node to a single scalar before starting communication. We can perform the full reduction in O(log(P)) stages requiring only a single scalar communicated per node per stage. Note that for pure-blocked layout there is no difference in implementation between commutative and non-commutative operators. For the case of pure-cyclic the difference is the most extreme. //=========================================================================== * Notes on algorithms Broadcast: ---------- In [Karp93] we learn how to construct an optimal schedule for broadcast. The tree can be built knowing only the ratio A=ELL/max(O_s,g). Using a heap data structure one can determine the tree structure for a P-node broadcast in O(log(P!)) time. Note that O(P) < O(log(P!)) < O(P*log(P)), so the cost to construct the schedule grows superlinearly. However, it doesn't require any communication (deterministic algorithm allows all nodes to compute the same schedule independently). For the case of A>>1 the communication is latency dominant and the optimal broadcast is the naive one in which the root sends data to each other node in sequence. For the case of A near 1, the optimal tree shape is a Binomial tree [Kielmann99]. Here are, for P=8, model results for the "EEL" of a broadcast with four different tree shapes on three different networks. Note that these are in us and are directly comparable between networks. However, they are just model results, not real timings. The "Flat" tree is the one-level tree of the naive broadcast. The percentages are relative to the optimal. gm vapi mpi-conduit Optimal 21 24 334 Binomial 21 30 (+28%) 570 (+171%) Binary 26 (+24%) 30 (+28%) 570 (+171%) Flat 43 (+105%) 31 (+32%) 334 The binary tree and the "flat" tree are the easiest two to implement. With just these two choices, we can get within 30% of the optimal on the two fast networks, and the flat tree is the optimal for the slow network. I don't have any bound on just how bad we could do with these two, or any heuristic for choosing between them. However, I expect to implement them as the first two versions. Later I hope to implement the construction of the optimal trees. I can see doing it two ways: 1) At init time for some fixed A, ignoring the data size. 2) At runtime, using some size-dependent (and conduit-specific) model for A and caching the results so it can be reused when the same size is repeated. Note that the data to cache would just be P*sizeof(gasnet_node_t) bytes. Scatter: -------- The LogGP paper [Alexandrov95] describes an algorithm for constructing the optimal scatter, analogous the the broadcast scahedule of [Karp93]. This being log_G_P, takes proper consideration of the varying size of the messages when a tree is used to forward data. Gather: ------- The Gather is exactly the same as scatter with all the data moving in the opposite direction. Therefore whatever I decide for Scatter will also be used to implement Gather. Exchange: --------- The final data placement is equivalent to doing P scatters, one with each node as the root. In a simple LogP model where we can only send a single data item per communication operation, the optimal [Karp93] is to perform P naive scatters in lock-step, but choosing the order of each node's sends so each node receives exactly once per step. The simplest such ordering is for node p to send to ((p+1) mod P), ((p+2) mod P),... ((p+P-1) mod P). Gather_All: ----------- The final data placement is equivalent to doing P broadcasts. In the LogP model with single-item sends the optimal [Karp93] is the same as described above for Exchange, but with the same data item instead of distinct ones. If we allow a node to send more than a single data item per send, then we can improve on the methods above. If we divide the communication into ceil(lg(P)) rounds (where lg denoted base-2 log) then each round can forward an exponentially increasing amount of data (1 item in round 0, 2 items in round 1, 2^i items in round i). At the end of the final round each node has received 2^i items. With the correct pairings for these sends, we can ensure that all P <= 2^i items are received at each node (and with a little care we can even avoid sending the "surplus" 2^i-P that exists when P is not a power of two). The ordering for this is that in round i (starting from 0) node p will send to node ((p+2^i) mod P). As it happens, this is exactly the same communication pattern used in the Dissemination barrier [Mellor-Crummey91; sec 3.3], but now we are also moving real data. For small P the Exchange-like algorithm is faster, even if we assume the timings are constant as the payload size grows in each round. As the ratio A (=ELL/max(O_s,g)) increases the crossover point moves toward larger P. The dominant term for the Exchange-like algorithm is O(P*max(O_s,g)), due to the gap between the P sends from each node. The dominant term for the Dissemination-like algorithm is O(lg(P)*ELL), due to the need to wait for data to arrive (before resending it) at each of the lg(P) rounds. Both algorithms move the exact same number of data bytes in and out of each node. The Dissemination-like algorithms tries to reduce the number of sends needed to actually move the data, but must pay a cost proportional to ELL to wait for the needed data to arrive. Permute: -------- Other than implementing the UPC synchronization, there is no obvious mechanism by which GASNet can do any better than the use of Puts. Reduce: ------- As noted in [Karp93], the optimal communication pattern for reduction in the LogP model is the same tree as for Broadcast. The order and direction of the comms are reversed and the data elements moving are partial "sums" from each subtree. There is no big concern over unequal contributions here, since each node must contribute exactly 1 or 0 "sums" to the total reduction. The implementation of non-commutative operators will take a little extra work, since in a cyclic or block-cyclic layout we can't reduce all the elements on a given node to a single scalar before we begin communication. Getting non-commutative operators right will also slightly restrict the trees we could legally use. However, the optimal trees are still optimal under any permutation of the node labels other than the root. Thus, we can always construct trees such that each subtree contains only consecutive contributions to the reduction. We'd probably construct our tree this way regardless of commutativity of our operator. PrefixReduce: ------------- The interface which is developing for this addresses how to express the unequal contribution from the nodes. I've whiteboarded some ideas and found that the communication pattern used by the dissemination barrier (and discussed as a possible all-gather-all implementation) is also an efficient algorithm for prefix reduction, regardless of commutativity of the operator. Step 1: Setup dst array We first copy the src to the dst. In the case that the first element of the src and dst array are not identically aligned, this will be a collective data movement not expressed by any of the existing interfaces. Where the src and dst are aligned, this step is entirely free of communication. Step 2: Partial sums Once the dst array is initialized with a copy of the src array, each node performs a reduction (NOT a prefix reduction) over each "row" of its block-cyclic contribution. This results in ceil(n_elem/blk_sz/P) partial reductions per node, with some nodes potentially having exactly one fewer if things aren't aligned perfectly. This step is entirely free of communication. Step 3: Disseminating the partial reductions Using a communication pattern identical to that of the dissemination barrier, the partial reductions can be passed around in such a way that every node can accumulate the partial reductions it will need to complete the prefix reduction. At each of log(P) phases each node will send to exactly one peer and receive from exactly one peer, thus there are no hot spots. Each of these pairwise communications includes the partial reductions for each "row", and thus sends n_elem/blk_sz/P elements (give or take 1). In each phase the partial reductions received are accumulated with the per-row reductions already received before sending the combined result in the next phase. This step is hard to convey w/o a whiteboard. So, you'll have to take my word that when done in this way we naturally accumulate the partial reductions (w/o needing to assume the operator is commutative) that would be needed to perform either a all-reduce-all or a prefix-reduce. We can actually reduce the size of some of the messages since we don't need any partial reductions coming "from the right". However, in the block-cyclic case it is only the partial reductions of the first and last "row" that might potentially be eliminated from some messages. Step 4: Prefix reduction At this point each node has the partial reductions it needs, one per "row". For instance, if a given node in a block-cyclic layout holds elements 2 and 3 in one "row" and elements 6 and 7 in another, then step 3 has generated SUM(a[0]..a[1]) and SUM(a[2]..a[5]). Using these two values one can finally perform the prefix reduction. Sort: ----- I've no bright ideas here. The most naive method would seem to be Gather, Sort on one node, Scatter. However, the Gather and Scatter want equal contribution from each node, and Sort does not require it. This might be another case for implementing collectives with variable-sized contributions. I have not yet finished searching for literature specific to this problem, and there are methods much better than the Gather+Scatter described above. //=========================================================================== Notes on MPI Collectives: The MPI collectives interface design and implementations are an important precedent for our work. Here's a few notes from a brief reading of the spec that are relevant to our future work: * MPI's inter-communicators are primarily intended to support client-server paradigms and dynamic job layouts that are outside the scope of GASNet. We can safely ignore all the inter-communicator goop. * MPI's collectives are all blocking and all have semantics analogous to IN_MYSYNC/OUT_MYSYNC. * MPI collectives allows varying contributions from each node and do not require single-valued arguments. We should consider generalizing UPC/GASNet collectives similarly. * MPI reduce and scan operations allow the user to compute multiple, independent reductions at once using a single call and recieve all the results (the data from each node is reduced element-wise with the corresponding elements on each node and the result returned similarly). This trivialy allows coalescing of messages, and is the only way to get communication overlap of logically independent MPI collectives with respect to a given node. GASNet provides non-blocking collectives, loose synchronization modes and aggregation to expose communication overlap (and is more fully general because it allows aggregation/overlap with reductions of other types, collectives of other varieties, collective operations on other teams, and other arbitrary computation and communication). However, the current UPC interface does not expose non-blocking operations, which means that from a user level the MPI reduction interface is currently more expressive. One notable advantage to the MPI-style reduction interface is that multiple independent reduction results can be computed using a single call to a user-provided callback - our current callback interface does not allow this optimization. * MPI provides the interesting reduction operators MINLOC and MAXLOC, which compute a minimum/maximum value and the index of the node holding that value (the operator is carefully defined to break ties in such a way that the operator is both commutative and associative). It may be nice for UPC to provide these as built-in operators (of course, a user could write these for themselves, but they may not realize that's what they want or that if can be done using a commutative/associative operator). * MPI provides a reduce-to-all operation that is currently lacking in UPC/GASNet, and can presumably be implemented more efficiently in some cases than a reduce followed by a broadcast. * MPI has a very powerful and expressive interface for efficiently creating and manipulating teams (groups/communicators). We should definitely use it as a model when defining our team constructors and ensure we can provide an equal level of generality. However, we should probably leave out the silly attribute caching in MPI communicators - it adds no useful functionality (all of the provided rationales for it can trivially be addressed by adding a level of indirection), so it just needlessly adds complexity. The MPI topology description library is similarly extraneous. * MPI-2 adds a few new features to the MPI 1.2 collectives library which may be relevant for us: - "in-place" versions of most of the collectives - an MPI_IN_PLACE flag generally specifies the source arguments are ignored and the destination arguments specify the source and destination for an in-place collective. In-place versions are likely to add some copying costs, but may be helpful for applications doing collectives over very large data sets (avoid doubling the necessary heap space just for communication, especially if the collective library can manage internal temporary space to get similar performance with much less memory overhead). There may also be some benefits in cache/firehose locality using an in-place operation that has a smaller memory footprint. We could add a similar flag if we convince UPC this is a useful feature. - "Exclusive-scan" prefix reduction - a prefix reduction where image 0 gets an undefined result and image i gets the result of 0 op 1 op .. op i-1. Exclusive scan is more general than inclusive scan and can implement inclusive scan with one final local evaluation of the reduction operator (the converse is not true for important, non-invertable operators like MIN/MAX). We should consider providing both (and implement inclusive using the exclusive one) //=========================================================================== Bibliography: @inproceedings{Karp93, author = {Richard M. Karp and Abhijit Sahay and Eunice E. Santos and Klaus Erik Schauser}, title = {Optimal broadcast and summation in the LogP model}, booktitle = {Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures}, year = {1993}, isbn = {0-89791-599-2}, pages = {142--153}, location = {Velen, Germany}, doi = {http://doi.acm.org/10.1145/165231.165250}, publisher = {ACM Press}, } @inproceedings{Kielmann99, author = {Thilo Kielmann and Rutger F. H. Hofman and Henri E. Bal and Aske Plaat and Raoul A. F. Bhoedjang}, title = {MagPIe: MPI's collective communication operations for clustered wide area systems}, booktitle = {Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming}, year = {1999}, isbn = {1-58113-100-3}, pages = {131--140}, location = {Atlanta, Georgia, United States}, doi = {http://doi.acm.org/10.1145/301104.301116}, publisher = {ACM Press}, } @article{Mellor-Crummey91, author = {John M. Mellor-Crummey and Michael L. Scott}, title = {Algorithms for scalable synchronization on shared-memory multiprocessors}, journal = {ACM Trans. Comput. Syst.}, volume = {9}, number = {1}, year = {1991}, issn = {0734-2071}, pages = {21--65}, doi = {http://doi.acm.org/10.1145/103727.103729}, publisher = {ACM Press}, } @article{Santos02, author = {Eunice E. Santos}, title = {Optimal and efficient algorithms for summing and prefix summing on parallel machines}, journal = {J. Parallel Distrib. Comput.}, volume = {62}, number = {4}, year = {2002}, issn = {0743-7315}, pages = {517--543}, doi = {http://dx.doi.org/10.1006/jpdc.2000.1698}, publisher = {Academic Press, Inc.}, } @article{Dusseau96, author = "Andrea C. Dusseau and David E. Culler and Klaus Erik Schauser and Richard P. Martin", title = "Fast Parallel Sorting Under Log{P}: Experience with the {CM}-5:", journal = "IEEE Transactions on Parallel and Distributed Systems", volume = "7", number = "8", pages = "791--805", year = "1996", url = "citeseer.ist.psu.edu/dusseau96fast.html" } @inproceedings{Alexandrov95, author = {Albert Alexandrov and Mihai F. Ionescu and Klaus E. Schauser and Chris Scheiman}, title = {LogGP: incorporating long messages into the LogP model\-one step closer towards a realistic model for parallel computation}, booktitle = {Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures}, year = {1995}, isbn = {0-89791-717-0}, pages = {95--105}, location = {Santa Barbara, California, United States}, doi = {http://doi.acm.org/10.1145/215399.215427}, publisher = {ACM Press}, note = {See also extended version: UCSB TRCS95-09}, } Verriet "Scheduling tree-structured programs in the LogP model"