This file contains miscellaneous platform-specific and/or site-specific notes about installing and using Titanium on various systems. Note that many supported platforms are not represented in this list - this list just includes special instructions for systems not covered by the general INSTALL document. For general usage notes regarding various backends, tracing functionality, native code use, etc, please see the INSTALL file and the Titanium documentation page here: http://titanium.cs.berkeley.edu/software.html Itanium-2/x86/Opteron ===================== The recommended backend C compiler is Intel C (gives best performance). Other supported backend C compilers include gcc, Portland Group C, Pathscale C. The recommended C++ compiler is Gnu g++ or Intel C++. PowerPC (including IBM SP) ========================== The recommended backend C compiler is VisualAge C (xlc) (gives best performance). Other supported backend C compilers include gcc. The recommended C++ compiler is Gnu g++ (gives best compile-time performance), however VisualAge C++ (xlC) is also supported. MIPS (including SGI Origin) ========================== The recommended backend C compiler is MIPSPro C (cc) (gives best performance). Other supported backend C compilers include gcc. The recommended C++ compiler is Gnu g++. IBM SP / AIX ============ The recommended parallel backends are: gasnet-lapi-smp, gasnet-lapi-uni Other backends available are: sequential, smp, gasnet-mpi-smp, gasnet-mpi-uni, mpi-cluster-smp, mpi-cluster-uniprocess, udp-cluster-smp, udp-cluster-uniprocess If you have trouble building the garbage collector, try configuring with --disable-shared There are reports that some SP configurations enable checkpointing by default for large jobs, and AIX's checkpointing is incompatible with pthread-based IPC such as that used by the gasnet-lapi-* backends (and LAPI itself). The workaround is to disable system checkpointing by adding the following line to the loadleveller commands at the top of your batch script: #@ checkpoint = no LAPI performance is greatly affected by the mechanism used to retrieve incoming communication (POLLING or INTERRUPT). POLLING seems to give the best overall performance and is currently the default. You can select the polling mode by setting the GASNET_LAPI_MODE variable to POLLING or INTERRUPT before running your application. When using the LAPI-based backends, you need to indicate the use of LAPI in your poe command and/or job script to ensure correct startup. You do this using the poe argument "-msg_api lapi" in your run command, or alternatively set MP_MSG_API=lapi in your script to get the same effect. The recommended poe mantra (used by tcrun) is: poe -nodes -tasks_per_node \ -euilib us -msg_api lapi \ -rmpool 1 -retry 1 -retrycount 1000 Note because the SP is a cluster-of-smp's there is hierarchical parallelism. There are several ways to take advantage of this using Titanium - you can either run with the gasnet-lapi-uni backend (which gives one Titanium thread per UNIX process) and specify -tasks_per_node > 1 to get multiple processes per node, or alternatively use the gasnet-lapi-smp backend (which allows multiple Titanium threads per process aka task), specify -tasks_per_node 1 and setup TI_THREADS to specify the thread layout within each process, for example: export TI_THREADS="8 8 8 8" ; poe myprog -nodes 4 -tasks_per_node 1 ... On SP's using recent versions of MPI, you can now run a Titanium application with mixed LAPI and MPI communication using the full number of processes per node. All you must do to use this functionality is to pass the poe argument "-msg_api mpi_lapi" in your run command - this tells MPI and LAPI to coexist peacefully on the network in your job. If you're using batch scripts, you can alternatively set MP_MSG_API=mpi_lapi in your script to get the same effect. If for some reason you decide to use the mpi-* backends on the IBM SP you should definitely consider setting the environment variable: setenv MP_SINGLE_THREAD yes This tells AIX to use the higher-performance, non-threadsafe version of MPI, and improves small message latency in mpi-* Titanium applications by about 20%. By default, AIX uses the thread-safe version of MPI which includes lots of extra locking overhead which you probably do not need - Titanium's mpi-* backends perform their own thread-safety locking above the MPI level, so Titanium code never needs a thread-safe MPI (no matter how many Titanium pthreads you have). The only corner case under which you might need thread-safe MPI is if you're calling thread-safe MPI-enabled libraries (which notably does *not* include FFTW-MPI, which is not thread-safe). AlphaServer / Tru64 =================== The recommended backend C compiler is Compaq C (cc) (gives best performance). Other supported backend C compilers include gcc. The recommended C++ compiler is Gnu g++ 3.3+ The recommended parallel backends are: gasnet-elan-smp, gasnet-elan-uni Other backends available are: sequential, smp, gasnet-mpi-smp, gasnet-mpi-uni, mpi-cluster-smp, mpi-cluster-uniprocess, udp-cluster-smp, udp-cluster-uniprocess The PAPI support on AlphaServer / Tru64 is very poor and incurs a high runtime overhead. PAPI on PSC's lemieux is only supported in jobs running via the batch system, and Titanium/PAPI users need the following in their batch script or interactive batch session: # set up the counters setenv DCPID "-dyn -slot cycles -slot pm -slot bmiss+retires" # start the PAPI server prun -N $RMS_NODES -n $RMS_NODES dcpi_start & # run Titanium/PAPI app as usual prun -N $RMS_NODES -n $RMS_NODES ./titanium-foo # if you run more Titanium/PAPI apps here, need to restart the PAPI server for each Cray X-1 ======== The recommended parallel backend is: gasnet-shmem-uni also available: sequential, smp, mpi-cluster-smp, mpi-cluster-uniprocess, udp-cluster-smp, udp-cluster-uniprocess Note the X1 has slow scalar processors, so very large compilations might need to run via the batch system to avoid timing out. tcbuild --verbose provides detailed info about compilation progress. Cross-compilation is highly recommended on this system to avoid the long compile times for building the compiler and applications. If you have login access to the X1's Solaris/Linux compile server, see cross-compilation instructions in INSTALL. Debugging runs can be done interactively with: aprun -n 4 prog args.. but production runs will need to use the batch system: http://www.ccs.ornl.gov/Phoenix/PBS.html Also note there is currently no GC on Unicos (and probably won't be any time soon). Here are some notes on Cray C compile options to monitor/control vectorization. You can use them with tcbuild --keep --cc-flags "whatever.." generate optimization report (in tc-gen*/*.lst): -h list=mi grep VECTOR |wc in that file -h report=imsvf other interesting Cray C compiler options: -htolerant - diables ansi anliasing -h restrict=(a,f,t) - states all pointers, argument pointers, or this pointers are restrict ptrs (no access via aliases) -h aggress - internal compiler tables are expanded to accommodate larger loop bodies (longer compile time) -h msp / -h ssp - change execution mode -h ivdep - ignore all loop deps #pragma ivdep