WorldCat Identities

Nicol, David M.

Overview
Works: 121 works in 268 publications in 1 language and 4,808 library holdings
Genres: Conference papers and proceedings 
Roles: Author, Editor
Classifications: LB2331, 378.12
Publication Timeline
.
Most widely held works by David M Nicol
McKeachie's teaching tips : strategies, research, and theory for college and university teachers by Marilla D Svinicki( Book )

4 editions published between 2010 and 2014 in English and held by 390 WorldCat member libraries worldwide

This indispensable handbook provides helpful strategies for dealing with both the everyday challenges of university teaching and those that arise in efforts to maximize learning for every student. The suggested strategies are supported by research and adaptable to specific classroom situations. Rather than suggest a "set of recipes" to be followed mechanically, the book gives instructors the tools they need to deal with the ever-changing dynamics of teaching and learning
Schedules for mapping irregular parallel computations by David M Nicol( )

4 editions published in 1987 in English and held by 291 WorldCat member libraries worldwide

An optimal repartitioning decision policy by David M Nicol( )

4 editions published in 1986 in English and held by 290 WorldCat member libraries worldwide

Automated parallelization of discrete state-space generation by David M Nicol( Book )

7 editions published between 1997 and 2000 in English and held by 140 WorldCat member libraries worldwide

We consider the problem of generating a large state-space in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial
Parallel algorithms for simulating continuous time Markov chains by David M Nicol( Book )

5 editions published in 1992 in English and held by 115 WorldCat member libraries worldwide

Abstract: "We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuous-time Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors."
Advanced techniques in reliability model representation and solution by Daniel L Palumbo( Book )

5 editions published in 1992 in English and held by 106 WorldCat member libraries worldwide

Proceedings : Workshop on Principles of Advanced and Distributed Simulation (PADS 2005), Monterey, California, June 1-3, 2005 by Workshop on Principles of Advanced and Distributed Simulation( Book )

8 editions published in 2005 in English and held by 97 WorldCat member libraries worldwide

User's guide to the Reliability Estimation System Testbed (REST) by David M Nicol( Book )

4 editions published in 1992 in English and Undetermined and held by 96 WorldCat member libraries worldwide

Distributed simulation, 1988 : proceedings of the SCS Multiconference on Distributed Simulation, 3-5 February, 1988, San Diego, California by SCS Multiconference on Distributed Simulation( Book )

11 editions published between 1985 and 1990 in English and held by 92 WorldCat member libraries worldwide

Advances in parallel and distributed simulation : proceedings of the SCS Multiconference on Advances in Parallel and Distributed Simulation, 23-25 January 1991, Anaheim, California by SCS Multiconference on Advances in Parallel and Distributed Simulation( Book )

7 editions published between 1990 and 1991 in English and held by 89 WorldCat member libraries worldwide

Optimistic barrier synchronization by David M Nicol( Book )

5 editions published in 1992 in English and held by 89 WorldCat member libraries worldwide

Abstract: "Barrier synchronization is a fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all work required of it prior to the synchronization. This paper treats the alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all necessary pre-synchronization computation. The problem arises when the number of pre-synchronization messages to be received by a processor is unknown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We desribe an optimistic O(log² P) barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions, as well as associative parallel prefix computations."
A sweep algorithm for massively parallel simulation of circuit-switched networks by Bruno Gaujal( Book )

4 editions published in 1992 in English and held by 87 WorldCat member libraries worldwide

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude
Massively parallel algorithms for trace-driven cache simulations by David M Nicol( Book )

3 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide

Abstract: "Trace-driven cache simulation is central to computer design. A trace is a very long sequence, x₁ ..., x[subscript N], of references to lines (contiguous locations) from main memory. At the t[superscript th] instant, reference x[subscript t] is hashed into a set of cache locations, the contents of which are then compared with x[subscript t]. If at the t[superscript th] instant x[subscript t] is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x[subscript t] present for the (t + 1)[superscript st] instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least-Recently-Used (LRU) policy, which regardless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference-based line replacement policies are considered, which includes LRU as well as the Least-Frequently-Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in time O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well-suited for SIMD implementations."
Rectilinear partitioning of irregular data parallel computations by David M Nicol( Book )

5 editions published in 1991 in English and held by 86 WorldCat member libraries worldwide

Abstract: "This paper describes new mapping algorithms for domain-oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns. We consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network and a relatively slower global network; these partitions heuristically attempt to maximize the fraction of communication carried by the local network. This paper provides an improved algorithm for finding the optimal partition in one dimension, new algorithms for partitioning in two dimensions, and shows that optimal partitioning in three dimensions is NP-complete. We discuss our application of these algorithms to real problems."
Inflated speedups in parallel simulations via malloc () by David M Nicol( Book )

4 editions published in 1990 in English and held by 85 WorldCat member libraries worldwide

Discrete-event simulation programs make heavy use of dynamic memory allocation in order to support simulation's very dynamic space requirements. When programming in C one is likely to use the malloc() routine. However, a parallel simulation which uses the standard Unix System V malloc() implementation may achieve an overly optimistic speedup, possibly superlinear. An alternate implementation provided on some (but not all) systems can avoid the speedup anomaly, but at the price of significantly reduced available free space. This is specially severe on most parallel architectures, which tend not to support virtual memory. This paper illustrates the problem, then shows how a simply implemented user-constructed interface to malloc() can both avoid artificially inflated speedups, and make efficient use of the dynamic memory space. The interface simply caches blocks on the basis of their size. We demonstrate the problem empirically, and show the effectiveness of our solution both empirically and analytically
Optimal processor assignment for pipeline computations( Book )

4 editions published in 1991 in English and held by 85 WorldCat member libraries worldwide

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual response times for different processor sizes, find an assignment of processors to tasks. Two objectives interest us: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, we assume that a large number of processors are to be assigned to a relatively small number of tasks. In this paper we develop efficient assignment algorithms for different classes of task structures. For a p processor system and a series-parallel precedence graph with n constituent tasks, we provide an O(np squared) algorithm that finds the optimal assignment for a response time optimization problem; we find the assignment optimizing the constrained throughput in o(np squared logp) time. Special cases of linear, independent, and three graphs are also considered. In addition, we also examine more efficient algorithms when certain restrictions are placed on the problem parameters. Our techniques are applied to a task system in computer vision
Performing out-of core FFTS on parallel disk systems by Thomas H Cormen( Book )

4 editions published in 1996 in English and held by 85 WorldCat member libraries worldwide

The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel input output systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory
Parametric binary dissection by Shahid H Bokhari( Book )

3 editions published in 1993 in English and held by 85 WorldCat member libraries worldwide

Abstract: "Binary dissection is widely used to partition non- uniform domains over parallel computers. This algorithm does not consider the perimeter, surface area, or aspect ratio of the regions being generated and can yield decompositions that have poor communication to computation ratio. Parametric Binary Dissection (PBD) is a new algorithm in which each cut is chosen to minimize load + [lambda]X(shape). In a 2 (or 3) dimensional problem, load is the amount of computation to be performed in a subregion and shape could refer to the perimeter (respectively surface) of that subregion. Shape is a measure of communication overhead and the parameter [lambda] permits us to trade off load imbalance against communication overhead. When [lambda] is zero, the algorithm reduces to plain binary dissection. This algorithm can be used to partition graphs embedded in 2 or 3-d. Here load is the number of nodes in a subregion, shape the number of edges that leave that subregion, and [lambda] the ratio of time to communicate over an edge to the time to compute at a node. We present an algorithm that finds the depth d parametric dissection of an embedded graph with n vertices and e edges in O(max[n log n, de]) time, which is an improvement over the O(dn log n) time of plain binary dissection. We also present parallel versions of this algorithm; the best of these requires O((n/p) log³ p) time on a p processor hypercube, assuming graphs of bounded degree. We describe how PBD is applied to 3-d unstructured meshes and yields partitions that are better than those obtained by plain dissection. We also discuss its application to the color image quantization problem, in which samples in high-resolution color space are mapped onto a lower resolution space in a way that minimizes the color error."
Binary dissection : variants & applications by Shahid H Bokhari( Book )

3 editions published in 1997 in English and held by 84 WorldCat member libraries worldwide

Partitioning is an important issue in a variety of applications. Two examples are domain decomposition for parallel computing and color image quantization. In the former we need to partition a computational task over many processors; in the latter we need to partition a high resolution color space into a small number of representative colors. In both cases, partitioning most be done in a manner that fields good results as defined by an application-specific metric. Binary dissection is a technique that has been widely used to partition non-uniform domains over parallel computers. It proceeds by recursively partitioning the given domain into two parts, such that each part has approximately equal computational load. The basic dissection algorithm does not consider the perimeter, surface area or aspect ratio of the two sub-regions generated at each step and can thus yield decompositions that have poor communication to computation ratios. We have developed and implemented several variants of the binary dissection approach that attempt to remedy this limitation. are faster than the basic algorithm, can be applied to a variety of problems, and are amenable to parallelization. We first present the Parametric Binary Dissection (PBD) algorithm, which takes into account volume and surface area when partitioning computational domains for use in parallel computing applications. We then consider another variant, the Fast Adaptive Dissection (FAD) algorithm which provides rapid spatial partitioning for use in color image quantization. We describe the performance of PBD and FAD on representative problems and present ways of parallelizing the PBD algorithm on or 3-d meshes and on hypercubes
Accurate modeling of parallel scientific computations by David M Nicol( Book )

6 editions published between 1988 and 1989 in English and held by 84 WorldCat member libraries worldwide

Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until load-time. Critical mapping and remapping decisions rest on our ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a one-dimensional fluids code. The models we construct are shown empirically to be accurate, and are used to find optimal remapping schedules. Keywords: Parallel processing; Dynamic remapping; Analytic modeling
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.54 (from 0.34 for McKeachie' ... to 0.68 for Distribute ...)

Distributed simulation, 1988 : proceedings of the SCS Multiconference on Distributed Simulation, 3-5 February, 1988, San Diego, California
Covers
Distributed simulation, 1988 : proceedings of the SCS Multiconference on Distributed Simulation, 3-5 February, 1988, San Diego, California
Languages
English (99)