Nicol, David M.
Overview
Works:  121 works in 268 publications in 1 language and 4,808 library holdings 

Genres:  Conference papers and proceedings 
Roles:  Author, Editor 
Classifications:  LB2331, 378.12 
Publication Timeline
.
Most widely held works by
David M Nicol
McKeachie's teaching tips : strategies, research, and theory for college and university teachers by
Marilla D Svinicki(
Book
)
4 editions published between 2010 and 2014 in English and held by 390 WorldCat member libraries worldwide
This indispensable handbook provides helpful strategies for dealing with both the everyday challenges of university teaching and those that arise in efforts to maximize learning for every student. The suggested strategies are supported by research and adaptable to specific classroom situations. Rather than suggest a "set of recipes" to be followed mechanically, the book gives instructors the tools they need to deal with the everchanging dynamics of teaching and learning
4 editions published between 2010 and 2014 in English and held by 390 WorldCat member libraries worldwide
This indispensable handbook provides helpful strategies for dealing with both the everyday challenges of university teaching and those that arise in efforts to maximize learning for every student. The suggested strategies are supported by research and adaptable to specific classroom situations. Rather than suggest a "set of recipes" to be followed mechanically, the book gives instructors the tools they need to deal with the everchanging dynamics of teaching and learning
Schedules for mapping irregular parallel computations by
David M Nicol(
)
4 editions published in 1987 in English and held by 291 WorldCat member libraries worldwide
4 editions published in 1987 in English and held by 291 WorldCat member libraries worldwide
An optimal repartitioning decision policy by
David M Nicol(
)
4 editions published in 1986 in English and held by 290 WorldCat member libraries worldwide
4 editions published in 1986 in English and held by 290 WorldCat member libraries worldwide
Automated parallelization of discrete statespace generation by
David M Nicol(
Book
)
7 editions published between 1997 and 2000 in English and held by 140 WorldCat member libraries worldwide
We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial
7 editions published between 1997 and 2000 in English and held by 140 WorldCat member libraries worldwide
We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial
Parallel algorithms for simulating continuous time Markov chains by
David M Nicol(
Book
)
5 editions published in 1992 in English and held by 115 WorldCat member libraries worldwide
Abstract: "We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuoustime Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors."
5 editions published in 1992 in English and held by 115 WorldCat member libraries worldwide
Abstract: "We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuoustime Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors."
Advanced techniques in reliability model representation and solution by
Daniel L Palumbo(
Book
)
5 editions published in 1992 in English and held by 106 WorldCat member libraries worldwide
5 editions published in 1992 in English and held by 106 WorldCat member libraries worldwide
Proceedings : Workshop on Principles of Advanced and Distributed Simulation (PADS 2005), Monterey, California, June 13, 2005 by
Workshop on Principles of Advanced and Distributed Simulation(
Book
)
8 editions published in 2005 in English and held by 97 WorldCat member libraries worldwide
8 editions published in 2005 in English and held by 97 WorldCat member libraries worldwide
User's guide to the Reliability Estimation System Testbed (REST) by
David M Nicol(
Book
)
4 editions published in 1992 in English and Undetermined and held by 96 WorldCat member libraries worldwide
4 editions published in 1992 in English and Undetermined and held by 96 WorldCat member libraries worldwide
Distributed simulation, 1988 : proceedings of the SCS Multiconference on Distributed Simulation, 35 February, 1988, San Diego,
California by
SCS Multiconference on Distributed Simulation(
Book
)
11 editions published between 1985 and 1990 in English and held by 92 WorldCat member libraries worldwide
11 editions published between 1985 and 1990 in English and held by 92 WorldCat member libraries worldwide
Advances in parallel and distributed simulation : proceedings of the SCS Multiconference on Advances in Parallel and Distributed
Simulation, 2325 January 1991, Anaheim, California by SCS Multiconference on Advances in Parallel and Distributed Simulation(
Book
)
7 editions published between 1990 and 1991 in English and held by 89 WorldCat member libraries worldwide
7 editions published between 1990 and 1991 in English and held by 89 WorldCat member libraries worldwide
Optimistic barrier synchronization by
David M Nicol(
Book
)
5 editions published in 1992 in English and held by 89 WorldCat member libraries worldwide
Abstract: "Barrier synchronization is a fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all work required of it prior to the synchronization. This paper treats the alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all necessary presynchronization computation. The problem arises when the number of presynchronization messages to be received by a processor is unknown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We desribe an optimistic O(log² P) barrier algorithm for such problems, study its performance on a largescale parallel system, and consider extensions to general associative reductions, as well as associative parallel prefix computations."
5 editions published in 1992 in English and held by 89 WorldCat member libraries worldwide
Abstract: "Barrier synchronization is a fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all work required of it prior to the synchronization. This paper treats the alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all necessary presynchronization computation. The problem arises when the number of presynchronization messages to be received by a processor is unknown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We desribe an optimistic O(log² P) barrier algorithm for such problems, study its performance on a largescale parallel system, and consider extensions to general associative reductions, as well as associative parallel prefix computations."
A sweep algorithm for massively parallel simulation of circuitswitched networks by
Bruno Gaujal(
Book
)
4 editions published in 1992 in English and held by 87 WorldCat member libraries worldwide
A new massively parallel algorithm is presented for simulating large asymmetric circuitswitched networks, controlled by a randomizedrouting policy that includes trunkreservation. A single instruction multiple data (SIMD) implementation is described and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude
4 editions published in 1992 in English and held by 87 WorldCat member libraries worldwide
A new massively parallel algorithm is presented for simulating large asymmetric circuitswitched networks, controlled by a randomizedrouting policy that includes trunkreservation. A single instruction multiple data (SIMD) implementation is described and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude
Massively parallel algorithms for tracedriven cache simulations by
David M Nicol(
Book
)
3 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide
Abstract: "Tracedriven cache simulation is central to computer design. A trace is a very long sequence, x₁ ..., x[subscript N], of references to lines (contiguous locations) from main memory. At the t[superscript th] instant, reference x[subscript t] is hashed into a set of cache locations, the contents of which are then compared with x[subscript t]. If at the t[superscript th] instant x[subscript t] is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x[subscript t] present for the (t + 1)[superscript st] instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the LeastRecentlyUsed (LRU) policy, which regardless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of the second algorithm's implementation on the MasPar MP1, a machine with 16384 processors. A broad class of referencebased line replacement policies are considered, which includes LRU as well as the LeastFrequentlyUsed and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in time O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are wellsuited for SIMD implementations."
3 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide
Abstract: "Tracedriven cache simulation is central to computer design. A trace is a very long sequence, x₁ ..., x[subscript N], of references to lines (contiguous locations) from main memory. At the t[superscript th] instant, reference x[subscript t] is hashed into a set of cache locations, the contents of which are then compared with x[subscript t]. If at the t[superscript th] instant x[subscript t] is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x[subscript t] present for the (t + 1)[superscript st] instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the LeastRecentlyUsed (LRU) policy, which regardless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of the second algorithm's implementation on the MasPar MP1, a machine with 16384 processors. A broad class of referencebased line replacement policies are considered, which includes LRU as well as the LeastFrequentlyUsed and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in time O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are wellsuited for SIMD implementations."
Rectilinear partitioning of irregular data parallel computations by
David M Nicol(
Book
)
5 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide
Abstract: "This paper describes new mapping algorithms for domainoriented dataparallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns. We consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network and a relatively slower global network; these partitions heuristically attempt to maximize the fraction of communication carried by the local network. This paper provides an improved algorithm for finding the optimal partition in one dimension, new algorithms for partitioning in two dimensions, and shows that optimal partitioning in three dimensions is NPcomplete. We discuss our application of these algorithms to real problems."
5 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide
Abstract: "This paper describes new mapping algorithms for domainoriented dataparallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns. We consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network and a relatively slower global network; these partitions heuristically attempt to maximize the fraction of communication carried by the local network. This paper provides an improved algorithm for finding the optimal partition in one dimension, new algorithms for partitioning in two dimensions, and shows that optimal partitioning in three dimensions is NPcomplete. We discuss our application of these algorithms to real problems."
Inflated speedups in parallel simulations via malloc () by
David M Nicol(
Book
)
4 editions published in 1990 in English and held by 85 WorldCat member libraries worldwide
Discreteevent simulation programs make heavy use of dynamic memory allocation in order to support simulation's very dynamic space requirements. When programming in C one is likely to use the malloc() routine. However, a parallel simulation which uses the standard Unix System V malloc() implementation may achieve an overly optimistic speedup, possibly superlinear. An alternate implementation provided on some (but not all) systems can avoid the speedup anomaly, but at the price of significantly reduced available free space. This is specially severe on most parallel architectures, which tend not to support virtual memory. This paper illustrates the problem, then shows how a simply implemented userconstructed interface to malloc() can both avoid artificially inflated speedups, and make efficient use of the dynamic memory space. The interface simply caches blocks on the basis of their size. We demonstrate the problem empirically, and show the effectiveness of our solution both empirically and analytically
4 editions published in 1990 in English and held by 85 WorldCat member libraries worldwide
Discreteevent simulation programs make heavy use of dynamic memory allocation in order to support simulation's very dynamic space requirements. When programming in C one is likely to use the malloc() routine. However, a parallel simulation which uses the standard Unix System V malloc() implementation may achieve an overly optimistic speedup, possibly superlinear. An alternate implementation provided on some (but not all) systems can avoid the speedup anomaly, but at the price of significantly reduced available free space. This is specially severe on most parallel architectures, which tend not to support virtual memory. This paper illustrates the problem, then shows how a simply implemented userconstructed interface to malloc() can both avoid artificially inflated speedups, and make efficient use of the dynamic memory space. The interface simply caches blocks on the basis of their size. We demonstrate the problem empirically, and show the effectiveness of our solution both empirically and analytically
Optimal processor assignment for pipeline computations(
Book
)
4 editions published in 1991 in English and held by 85 WorldCat member libraries worldwide
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual response times for different processor sizes, find an assignment of processors to tasks. Two objectives interest us: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, we assume that a large number of processors are to be assigned to a relatively small number of tasks. In this paper we develop efficient assignment algorithms for different classes of task structures. For a p processor system and a seriesparallel precedence graph with n constituent tasks, we provide an O(np squared) algorithm that finds the optimal assignment for a response time optimization problem; we find the assignment optimizing the constrained throughput in o(np squared logp) time. Special cases of linear, independent, and three graphs are also considered. In addition, we also examine more efficient algorithms when certain restrictions are placed on the problem parameters. Our techniques are applied to a task system in computer vision
4 editions published in 1991 in English and held by 85 WorldCat member libraries worldwide
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual response times for different processor sizes, find an assignment of processors to tasks. Two objectives interest us: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, we assume that a large number of processors are to be assigned to a relatively small number of tasks. In this paper we develop efficient assignment algorithms for different classes of task structures. For a p processor system and a seriesparallel precedence graph with n constituent tasks, we provide an O(np squared) algorithm that finds the optimal assignment for a response time optimization problem; we find the assignment optimizing the constrained throughput in o(np squared logp) time. Special cases of linear, independent, and three graphs are also considered. In addition, we also examine more efficient algorithms when certain restrictions are placed on the problem parameters. Our techniques are applied to a task system in computer vision
Performing outof core FFTS on parallel disk systems by
Thomas H Cormen(
Book
)
4 editions published in 1996 in English and held by 85 WorldCat member libraries worldwide
The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one dimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel input output systems can improve performance considerably. This paper shows how to perform onedimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing outofcore FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than incore methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for incore methods even when they run entirely in memory
4 editions published in 1996 in English and held by 85 WorldCat member libraries worldwide
The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one dimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel input output systems can improve performance considerably. This paper shows how to perform onedimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing outofcore FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than incore methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for incore methods even when they run entirely in memory
Parametric binary dissection by
Shahid H Bokhari(
Book
)
3 editions published in 1993 in English and held by 85 WorldCat member libraries worldwide
Abstract: "Binary dissection is widely used to partition non uniform domains over parallel computers. This algorithm does not consider the perimeter, surface area, or aspect ratio of the regions being generated and can yield decompositions that have poor communication to computation ratio. Parametric Binary Dissection (PBD) is a new algorithm in which each cut is chosen to minimize load + [lambda]X(shape). In a 2 (or 3) dimensional problem, load is the amount of computation to be performed in a subregion and shape could refer to the perimeter (respectively surface) of that subregion. Shape is a measure of communication overhead and the parameter [lambda] permits us to trade off load imbalance against communication overhead. When [lambda] is zero, the algorithm reduces to plain binary dissection. This algorithm can be used to partition graphs embedded in 2 or 3d. Here load is the number of nodes in a subregion, shape the number of edges that leave that subregion, and [lambda] the ratio of time to communicate over an edge to the time to compute at a node. We present an algorithm that finds the depth d parametric dissection of an embedded graph with n vertices and e edges in O(max[n log n, de]) time, which is an improvement over the O(dn log n) time of plain binary dissection. We also present parallel versions of this algorithm; the best of these requires O((n/p) log³ p) time on a p processor hypercube, assuming graphs of bounded degree. We describe how PBD is applied to 3d unstructured meshes and yields partitions that are better than those obtained by plain dissection. We also discuss its application to the color image quantization problem, in which samples in highresolution color space are mapped onto a lower resolution space in a way that minimizes the color error."
3 editions published in 1993 in English and held by 85 WorldCat member libraries worldwide
Abstract: "Binary dissection is widely used to partition non uniform domains over parallel computers. This algorithm does not consider the perimeter, surface area, or aspect ratio of the regions being generated and can yield decompositions that have poor communication to computation ratio. Parametric Binary Dissection (PBD) is a new algorithm in which each cut is chosen to minimize load + [lambda]X(shape). In a 2 (or 3) dimensional problem, load is the amount of computation to be performed in a subregion and shape could refer to the perimeter (respectively surface) of that subregion. Shape is a measure of communication overhead and the parameter [lambda] permits us to trade off load imbalance against communication overhead. When [lambda] is zero, the algorithm reduces to plain binary dissection. This algorithm can be used to partition graphs embedded in 2 or 3d. Here load is the number of nodes in a subregion, shape the number of edges that leave that subregion, and [lambda] the ratio of time to communicate over an edge to the time to compute at a node. We present an algorithm that finds the depth d parametric dissection of an embedded graph with n vertices and e edges in O(max[n log n, de]) time, which is an improvement over the O(dn log n) time of plain binary dissection. We also present parallel versions of this algorithm; the best of these requires O((n/p) log³ p) time on a p processor hypercube, assuming graphs of bounded degree. We describe how PBD is applied to 3d unstructured meshes and yields partitions that are better than those obtained by plain dissection. We also discuss its application to the color image quantization problem, in which samples in highresolution color space are mapped onto a lower resolution space in a way that minimizes the color error."
Binary dissection : variants & applications by
Shahid H Bokhari(
Book
)
3 editions published in 1997 in English and held by 84 WorldCat member libraries worldwide
Partitioning is an important issue in a variety of applications. Two examples are domain decomposition for parallel computing and color image quantization. In the former we need to partition a computational task over many processors; in the latter we need to partition a high resolution color space into a small number of representative colors. In both cases, partitioning most be done in a manner that fields good results as defined by an applicationspecific metric. Binary dissection is a technique that has been widely used to partition nonuniform domains over parallel computers. It proceeds by recursively partitioning the given domain into two parts, such that each part has approximately equal computational load. The basic dissection algorithm does not consider the perimeter, surface area or aspect ratio of the two subregions generated at each step and can thus yield decompositions that have poor communication to computation ratios. We have developed and implemented several variants of the binary dissection approach that attempt to remedy this limitation. are faster than the basic algorithm, can be applied to a variety of problems, and are amenable to parallelization. We first present the Parametric Binary Dissection (PBD) algorithm, which takes into account volume and surface area when partitioning computational domains for use in parallel computing applications. We then consider another variant, the Fast Adaptive Dissection (FAD) algorithm which provides rapid spatial partitioning for use in color image quantization. We describe the performance of PBD and FAD on representative problems and present ways of parallelizing the PBD algorithm on or 3d meshes and on hypercubes
3 editions published in 1997 in English and held by 84 WorldCat member libraries worldwide
Partitioning is an important issue in a variety of applications. Two examples are domain decomposition for parallel computing and color image quantization. In the former we need to partition a computational task over many processors; in the latter we need to partition a high resolution color space into a small number of representative colors. In both cases, partitioning most be done in a manner that fields good results as defined by an applicationspecific metric. Binary dissection is a technique that has been widely used to partition nonuniform domains over parallel computers. It proceeds by recursively partitioning the given domain into two parts, such that each part has approximately equal computational load. The basic dissection algorithm does not consider the perimeter, surface area or aspect ratio of the two subregions generated at each step and can thus yield decompositions that have poor communication to computation ratios. We have developed and implemented several variants of the binary dissection approach that attempt to remedy this limitation. are faster than the basic algorithm, can be applied to a variety of problems, and are amenable to parallelization. We first present the Parametric Binary Dissection (PBD) algorithm, which takes into account volume and surface area when partitioning computational domains for use in parallel computing applications. We then consider another variant, the Fast Adaptive Dissection (FAD) algorithm which provides rapid spatial partitioning for use in color image quantization. We describe the performance of PBD and FAD on representative problems and present ways of parallelizing the PBD algorithm on or 3d meshes and on hypercubes
Accurate modeling of parallel scientific computations by
David M Nicol(
Book
)
6 editions published between 1988 and 1989 in English and held by 84 WorldCat member libraries worldwide
Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until loadtime. Critical mapping and remapping decisions rest on our ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a onedimensional fluids code. The models we construct are shown empirically to be accurate, and are used to find optimal remapping schedules. Keywords: Parallel processing; Dynamic remapping; Analytic modeling
6 editions published between 1988 and 1989 in English and held by 84 WorldCat member libraries worldwide
Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until loadtime. Critical mapping and remapping decisions rest on our ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a onedimensional fluids code. The models we construct are shown empirically to be accurate, and are used to find optimal remapping schedules. Keywords: Parallel processing; Dynamic remapping; Analytic modeling
more
fewer
Audience Level
0 

1  
Kids  General  Special 
Related Identities
 Langley Research Center
 Institute for Computer Applications in Science and Engineering
 Saltz, Joel Author
 Reynolds, Paul F. (Paul Francis)
 Svinicki, Marilla D. 1946 Author
 McKeachie, Wilbert James 1921 Author
 Universities Space Research Association
 Palumbo, Daniel L. Author
 Bokhari, Shahid H. Author
 Mao, Weizhen Author
Associated Subjects
Algorithms Binary system (Mathematics) Cache memory Cartography College teaching Color Combinatorial optimization Computer architecture Computer programming Computer science Computer simulation Computer softwareReliability ComputersReliability Decision making Decomposition (Mathematics) Digital computer simulation Dissection Electronic data processing Electronic data processingDistributed processing First year teachers Fourier transformations Magnetic disks Markov processes Mathematical models Memory management (Computer science) Monte Carlo method Parallel algorithms Parallel computers Parallel processing (Electronic computers) Partitions (Mathematics) Petri nets Pipelines Production scheduling Programming languages (Electronic computers) Queuing theory Reliability (Engineering) Scheduling Switching circuitsSimulation methods Synchronization System analysis Teaching