Task-Oriented Parallel C (TOP-C) This is version 0.9 of Task-Oriented Parallel C (TOP-C). There are still unpolished pieces of code (such as the Makefile), but it has been tested and works in my environment on several workstation architectures. It provides task-oriented parallelism to the end user. The same application code runs on top of (a) a shared memory library using threads (currently supporting SGI, HP Convex, and SUN Solaris), (b) a distributed memory library using MPI, or (c) a single-process sequential library (useful for development and debugging). A loader version is supplied for dynamically adding new processors in case b. The directory, mpinu, contains a small subset implementation of MPI suitable for supporting this distribution. Several large computations have been completed using this model. (See http://www.ccs.neu.edu/home/gene/par-tools.html) To unpack: gzip -dc top-c.tar.gz | tar xvf - Don't forget to modify the provided Makefile and procgroup file. Please write the author, Gene Cooperman (gene@ccs.neu.edu), if there are any questions. BASIC CONCEPT: The system is based on two key concepts: tasks in the context of a master/slave environment; and global, shared environment with lazy updates. Task descriptions (inputs) are generated on the master, and assigned to a slave. The slave executes the task and returns the result to the master. The master may update its own private data based on the result, or it may update data on all processes. Such global updates take place on each slave after the slave completes its current task. A SPMD (Single Program Multiple Data) style of programming is encouraged. In both shared and distributed memory architectures, one must worry about the order of reads and writes as multiple slaves autonomously update data. The utilities below are meant to ease that chore, by supporting the ease of the SPMD programming style, while still maintaining good efficiency and generality for a broad range of applications. The software can easily be ported to a variety of architectures. ================================================================== OVERVIEW: (ON MASTER) task (ON A SLAVE) result (ON MASTER) get_task() --------> do_task(task) ---------> get_task_result(result, task) if (action == UPDATE) (ON ALL PROCESSES) -----------------------> update_environment(result, task) When there is only one slave, the effect is the same as the following sequential C code: { void *task; while ((task = get_task(), NOTASK != task)) get_task_result(do_task(), task); } For more flexibility, get_task_result returns a value (an action) that is used to provide further options. The most common option is to specify that master_slave should call a routine, update_environment(result, task), which is executed ON MASTER AND ALL SLAVES (master alone, if a shared memory architecture). The full set of return values and options for get_task_result() is specified as part of the description of get_task_result() in the section on DETAILS. Finally, a task is a user-defined data structure. It must be cast to the type (void *) before being passed to the system. In a shared-memory architecture, it should contain either data fitting inside a machine pointer, such as "int", or else it should point to the heap, where the shared memory lies (not to the stack). In a distributed memory architecture, the entire data of the task must be passed as a message. To do so, it must typically be stored contiguously in memory. If a heterogeneous architecture is used, there is an issue of converting data formats. For a further overview, see the section, TRACING AND OTHER DEBUGGING TECHNIQUES, for sequential code that is equivalent to the operation of master_slave when there is only one slave. ================================================================== DETAILS: void init_master_slave( req_num_slaves ) int num_slaves; Must be called before any other routines, below. The system will allocate at most req_num_slaves. If it's 0 or less, the system will allocate an "efficient" number, usually the same as the number of free processors. #include "mas-slave.h" void master_slave(get_task, do_task, get_task_result, update_environment) void *(*get_task)(), *(*do_task)(), (*update_environment)(); int (*get_task_result)(); MSG(ptr, size) Specifies arbitrary user data structure; get_task()/do_task() should return such a data structure. In do_task(), ptr must not reference a buffer on stack; Declare buffer static to avoid this. The user must have define the following four functions: void *get_task( void ) acts on master; returns an arbitrary user data structure, TASK. TASK should be a user buffer specified by MSG(ptr, size). It should return NOTASK, when there are no more tasks, and it should continue to return NOTASK if invoked again. void *do_task( void *task ) acts on slave; operates on the result of get_task(); returns another arbitrary user data structure, RESULT. RESULT must be a static or global user buffer specified by MSG(ptr, size). int get_task_result( void *result ) acts on master; operates on the result of get_task_result(); returns an ACTION that determines what happens to the task next. The pointer argument, get_task_result, may be NULL, in which case, the action is always NO_ACTION. CONTINUATION() is a parametrized action that may be returned. is_up_to_date() may usefully be called by get_task_result() (see below) Actions returned by get_task_result: UPDATE: C constant, invoking update_environment( void *task) (see below) also updates bookkeeping for sake of is_up_to_date() (see below) NO_ACTION: C constant, causing no further action for task REDO: Invoke do_task() on result of get_task() again, on original slave; useful if global variables have changed since original do_task; see is_up_to_date(), below CONTINUATION( void *task ): Like REDO, but if the result of CONTINUATION( newtask ) is returned by get_task_result(), then do_task( newtask ) is called on the original slave. useful if only the master can decide whether task is complete. Note that any pending calls to update_environment() will have occurred on the slave before the new call to do_task(). void update_environment( void *result, void *task ) acts on master and all slaves; operates on the result of do_task(), and the original task returned by get_task(); called only if get_task_result() returned UPDATE; useful for updating non-shared, global variables in all processes; The pointer argument, update_environment, may be NULL, and a modification of the data structure will still be recorded for is_up_to_date() (below). In a shared memory environment, only the master does update_environment(). It is the user's responsibility to ensure that the slaves read consistent data, when the master modifies it. is_up_to_date( void ) returns TRUE or FALSE; returns TRUE if and only if get_task_result() has not returned the result UPDATE (invoking update_environment) between the time when get_task() was originally called on the current task, and the time when the corresponding get_task_result() was called. Typical usage: int get_task_result( result ) void *result; { if (is_up_to_date()) return NO_ACTION; else return REDO; } get_last_source( void ) returns unique id, as a C int, for the last process from which a message was received. This is trivial, when called on a slave, but it is useful on the master. is_master( void ) returns true if and only if executed on the master. This is primarily useful for efficiency. One can choose to save time or space by computing a certain data structure only on the master. ms_barrier() - called on master; only useful for shared memory applications. waits until current slave tasks complete, and then returns to finish current master routine. Provides a critical section when threads are guaranteed not to be running. MSG( void *buf, int size ) - returned by get_task and do_task; Currently get_task and do_task may also return an int cast to (void *), instead of MSG(), but this is experimental, and currently discouraged. declare_recv_buf(char *buf, int size) - useful only for distributed memory. If message size of your application is too large, error message will ask you to declare your own receive buffer on master and/or slave. ================================================================== COURTESY TO OTHERS It is easy for parallel jobs to demand excessive resources. Some simple UNIX system calls prevent this. #include alarm(int SECS) - kill job after SECS. Place in main() or else place in get_task() and do_task(). In the latter case, the alarm resets to SECS after each task, so that it will allow unlimited tasks, but kill hung jobs. #include #include setpriority(PRIO_PROCESS,getpid(),prio) - prio = 10 still gives you some CPU time. prio = 19 means that any job of higher priority always runs before you. Place in main() #include struct rlimit rlp; rlp.rlim_max = rlp.rlim_cur = SIZE; setrlimit(RLIMIT_RSS, &rlp) - SIZE is RAM limit (bytes). If your system has significant paging, the system will prefer to keep your process from growing beyond SIZE bytes of resident RAM. Even if you set nice to priority 20, this is still important. Otherwise you may cause someone to page out much of his or her job in your favor during one of your infrequent quantum slices of CPU time. Place in main() ================================================================== TRACING AND OTHER DEBUGGING TECHNIQUES First, link your application code with mas-slave-seq.c, and make sure that your application works correctly sequentially. Only after you have confidence in the correctness of the sequential code, should you begin to debug the parallel version. Code development should initially be done on one master and one slave, thus ensuring that one debugs essentially sequential code. If possible, the master and slave should be the same CPU, so as to minimize network delays and ill effects on other users. When that code works correctly, it can then be tested on two slaves, and finally all possible slaves. Beyond that, one has two alternative debugging strategies. One is to trace messages between master and slaves (for any number of slaves). In addition, to tracing, the global variables, task_size and result_size, are available to confirm the size of the respective messages received. A second strategy is to debug in the context of a single slave. In this case, the code is "almost" sequential. If the bug is still present, once can test further by replacing the call to master_slave() by: /* FULL CASE */ { void *task, result, cont_task = NULL; int action; while ((task = (cont_task == NULL ? get_task() : cont_task), NOTASK != task)) { cont_task = NULL; result = do_task(task); action = get_task_result(result, task); switch (action) { case NO_ACTION: ; case UPDATE: if (update_environment != NULL) update_environment(result, task); case REDO: error("REDO shouldn't happen in sequential case\n"); case CONTINUE: cont_task = continuation_ptr; default: error("illegal return value of get_task_result\n"); } } In this context, one is debugging a completely sequential program. The following variables are provided to trace messages between master and slave. int master_slave_trace; Should have value TRUE, FALSE, or NOSTATS (default is FALSE); If set to TRUE, will provide a trace of communication between master and slave. If FALSE, provides only summary statistics at end. If NOSTATS, no extra printing at all. The statistics refer only to time in master_slave(). Typically, about 4 milliseconds will also be spent in init_master_slave(), and about 20 milliseconds in master_slave_stats(); void (*master_slave_trace_cmd)(int dest, void *msg); Global pointer to function. User can set it to his or her own trace function. The default function does: printf("%x (%d)",msg,msg); For example, if you pass integers, use: printf("%d",*(int *)msg); void master_slave_stats(); Prints cumulative statistics from all invocations of master_slave(); In MPI version, this is required to terminate process. Note that tracing takes place entirely on the master. So, any print statements produced by a slave may be asynchronous with the trace printing and other printing on the master. If you find the master hanging, waiting for a slave message, then the probably cause is that do_task() is doing something bad (hanging, infinite loop, bus/segmentation error, etc.). If you are really desperate, note that gdb (the GNU C debugger) includes an "attach" command that allows you to attach and debug a separate running process. This lets you debug a running slave, if it is running o the same processor. Finally, there is a common TOP-C performance bug related to SMD operating system schedulers that can occur under certain conditions. The routine wait_a_little() in threads.c for shared memory machines causes a thread to sleep a little while waiting for the next task. While the requested sleep time is 10 milliseconds, an operating system scheduler often sets the quantum for a time sharing slice at 100 milliseconds. If other processes are competing for the processor, and if your task completes in much less than 100 milliseconds, then you are effectively giving away the processor during the rest of your quantum. So, it a task requires 10 milliseconds, you then give away the processor for the next 100 milliseconds, resulting in you getting 10 % of the time instead of your fair share of 50 % of the time. Other useful techniques that may improve performance of certain applications are: (1) set up multiple slaves on each processor (if slave processors are sometimes idle) (2) re-write the code to bundle a set of tasks as a single task (to improve the granularity of your parallelism) ================================================================== RAW_MASTER_SLAVE There are instances when tasks are most naturally generated deep inside nested loops. In such circumstances, it may be difficult to re-write the code to create a function get_code(), since that would require turning the loops inside out. (If you don't know what this refers to, then you don't need raw_master_slave(). When you need raw_master_slave(), it will become obvious what this refers to. void raw_master_slave(parallel_loop, do_task, get_task_result, update_environment) void (*parallel_loop)(), *(*do_task)(), (*update_environment)(); int (*get_task_result)(); This behaves like master_slave, except that parallel_loop() should be a routine whose code contains the nested loops referred to above. Inside parallel_loop(), whenever the user wishes to generate a new task for solution by the slave, he should invoke: void set_task( void *task ) Invoked from inside user function, parallel_loop(); The argument, TASK, corresponds to what would be returned by get_task() in the routine master_slave(). TASK will be executed be processed by do_task() and its siblings, just as in master_slave()). There can be multiple occrences of set_task() in parallel_loop(). ================================================================== LOADER There are also situations in which one wishes to add new slaves during a currently running session (for advanced users, only). This is made possible for homogeneous architectures by invoking the loader. This was written jointly by Gene Cooperman and Victor Grinberg. The .../top-c/loader/config.h file should be modified with the network address of the running job: MASTERNAME and MASTERPORT (and currently change them in C files or use command line args) Other parameters can be left as default. One then recompiles: (cd .../top-c/loader; make; cd .../top-c; make APPLICATION) For an example APPLICATION do: make parfactor-loader Finally, one runs the application: ./a.out -masterport PORTNUMBER and on a remote processor of the same architecture, one can download the binary, .../top-c/loader, and execute: ./loader -p PORTNUMBER The loader application will continue to run until killed, so that the machine will continue to act as a CPU server for further jobs. PORTNUMBER will default to MASTERPORT. The provision for adding PORTNUMBER to the command line is provided only in case the default port is in use. ================================================================== IMPROVING PERFORMANCE If your application runs too slowly due to excessive time for communication, consider running multiple slave processes on a single processor. This allows one process to continue computing while another is communicating or idle waiting for a new task to be generated by the master. If communication overhead or idle time is still to high, consider if it is possible to increase the granularily of your task -- perhaps by amalgamating several consecutive tasks as a single larger task to be performed by a single process. Finally, if you have a more efficient version of MPI (perhaps a vendor version tuned to your hardware), consider replacing LIBMPI in .../top-c/Makefile by your vendor's limbpi.a, and delete or modify the the LIBMPI target in the Makefile. ================================================================== ACKNOWLEDGEMENTS The author wishes to thank the Mariner Project at Boston University for the use of facilities which helped in the development of this software. An earlier, experimental version of mpinu was written by Markos Kyzas. The loader module is joint with Victor Grinberg.