WorldCat Identities

Burger, Doug 1969-

Overview
Works: 31 works in 35 publications in 1 language and 46 library holdings
Genres: Academic theses 
Roles: Author
Publication Timeline
.
Most widely held works by Doug Burger
Hardware techniques to improve the performance of the processor/memory interface by Doug Burger( )

3 editions published in 1998 in English and held by 4 WorldCat member libraries worldwide

Finally, the distribution of processing power into physical memory, to reduce both memory latency and traffic, is explored. One such architecture is evaluated in detail (the DataScalar architecture), and it is shown that--for memory-limited applications--this scheme can offer significant speedups (9% to 100%)
Parallelizing Appbt for a shared-memory multiprocessor by Doug Burger( Book )

1 edition published in 1995 in English and held by 3 WorldCat member libraries worldwide

Abstract: "The NAS Parallel Benchmarks are a collection of simplified computational fluid dynamic (CFD) applications. We have rewritten and parallelized Appbt -- a CFD application that uses the solution of a block-tridiagonal system -- to run efficiently on shared- memory multiprocessors. We tested our algorithm through simulation on the Wisconsin Wind Tunnel. We tested our code on two major types of shared- memory multiprocessors, one using a dir[subscript N]NB directory protocol and one using the Scalable Coherent Interface cache-coherence protocol. We simulated the code with these protocols for machines ranging from 1 to 128 processors. We found that our parallelization methodology worked well for up to 128 processors, and will apparently scale to even larger systems. Further study is required to confirm this hypothesis, however."
The declining effectiveness of dynamic caching for general- purpose microprocessors by Doug Burger( Book )

2 editions published in 1995 in English and held by 3 WorldCat member libraries worldwide

Abstract: "The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operands, the bandwidth of the memory system, especially that across the chip boundary, will become a progressively greater limit to high performance. After describing the current state of microsolutions aimed at alleviating the memory bottleneck, this paper postulates that dynamic caches themselves use memory inefficiently and will impede attempts to solve the memory problem. We present an analysis of several important algorithms, which shows that increasing levels of integration will not result in computational requirements outstripping off- chip bandwidth needs, thereby preserving the memory bottleneck. We then present results from two sets of simulations, which measured both the efficiency with which current caching techniques use memory (generally less than 20%), and how well (or poorly) caches reduce traffic to main memory (cache sizes up to 2000 times worse than optimal). We then discuss how two classes of techniques, (i) decoupling memory operations from computation, and (ii) explicit compiler management of the memory hierarchy, provide better long-term solutions to lowering a program's memory latencies and bandwidth requirements. Finally, we describe Galileo, a new project that will attempt to provide a long-term solution to the pernicious memory bottleneck."
Design of wide-issue high-frequency processors in wire delay dominated technologies by Hrishikesh Sathyavasu Murukkathampoondi( )

1 edition published in 2004 in English and held by 3 WorldCat member libraries worldwide

Special issue on tools for computer architecture research( Book )

1 edition published in 2004 in English and held by 2 WorldCat member libraries worldwide

Paging tradeoffs in distributed-shared-memory multiprocessors( Book )

1 edition published in 1994 in English and held by 2 WorldCat member libraries worldwide

Abstract: "Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared-memory parallel application programs on a simulated distributed- shared-memory machine. Our results (i) confirm the importance of gang CPU scheduling, (ii) show that a page-faulting processor should spin rather than invoke a parallel context switch, (iii) show that our parallel programs frequently touch most of their data, and (iv) indicate that memory, not just CPUs, must be 'gang scheduled'. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel-context-switch overheads."
Quantifying memory bandwidth limitations of current and future microprocessors by Doug Burger( Book )

1 edition published in 1996 in English and held by 2 WorldCat member libraries worldwide

Abstract: "This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal- traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal- traffic caches -- implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off- chip accesses will be so expensive that all system memory will reside on one or more processor chips."
Simulation of the SCI transport layer on the Wisconsin Wind Tunnel by Doug Burger( Book )

1 edition published in 1995 in English and held by 2 WorldCat member libraries worldwide

Abstract: "Parallel simulation of parallel machines is fast becoming a critical technique for the evaluation of new parallel architectures and architectural extensions. Fast and accurate simulation of the interconnection network in parallel simulators is extremely difficult, but also extremely important. In this report, we describe an extension to the Wisconsin Wind Tunnel that simulates the transport layer of the Scalable Coherent Interface. This module enables an evaluation of switch designs and network topologies that uses real parallel codes. It also enables the exploration of architectural and protocol optimization effects on network performance. Finally, our extension increases confidence in SCI-related results that were obtained without detailed network simulation."
Accuracy vs. performance in parallel simulation of interconnection networks by Doug Burger( Book )

1 edition published in 1995 in English and held by 2 WorldCat member libraries worldwide

Abstract: "Parallel simulation is emerging as the dominant technique for studying parallel computers. However, the interconnection networks of these machines can be modeled at many different levels of abstraction, allowing researchers to trade off accuracy and performance. In this paper, we use the Wisconsin Wind Tunnel, a parallel simulator for cache-coherent shared-memory machines, to study the trade-offs of accuracy versus performance for six different network simulation models. We evaluate these models for a variety of parallel applications, cache- coherence protocols, and topologies. We show that only the two most expensive models -- which model contention at individual links -- are robust in the presence of high network loads or non-uniform traffic patterns."
A routing network for the grid processor architecture by Vincent Ajay Singh( Book )

2 editions published in 2003 in English and held by 2 WorldCat member libraries worldwide

This is a technical report on the proposed in-grid network/router for the grid architecture. This router architecture demonstrates a lightweight and robust solution for on-chip operand networks, and incorporates backpressure and dynamic routing techniques. A representative design was implemented in Verilog to test the functionality, and implemented at the circuit level in order to examine worst case delay. The results show that, in the common case, operands will be available to the next processor without incurring anything but transmission delay. Worst case delay estimates for the control logic and the transmission delay are presented for 100nm and 35nm technologies
Memory hierarchy extensions to the SimpleScalar tool set by Doug Burger( )

1 edition published in 1999 in English and held by 1 WorldCat member library worldwide

In this report we describe memory hierarchy extensions to the Simplescalar tool set. The extensions allow for the modeling of an arbitrary hierarchy of caches and associated buses. The caches are non-blocking caches with a finite number of miss status holding registers and support both virtual and physical addressing. We also model detailed simulation of address translation, the hardware translation look-aside buffer (TLB) fills and page table walks. In this report we describe the cache and memory organization that we model and present the implementation in detail. We also describe how these extensions can be configured and provide a sample configuration
Application binary interface (ABI) manual by Aaron Lee Smith( )

1 edition published in 2005 in English and held by 1 WorldCat member library worldwide

This document specifies the TRIPS Application Binary Interface (ABI) for the TRIPS architecture, a novel, scalable, and low power architecture for future technologies
End-to-end validation of architectural power models by Madhu Saravana Sibi Govindan( )

1 edition published in 2008 in English and held by 1 WorldCat member library worldwide

While researchers have invested substantial effort to build architectural power models, validating such models has proven difficult at best. In this paper, we examine the accuracy of commonly used architectural power models on a custom ASIC microprocessor. Our platform is the TRIPS system for which we have readily available high-level simulators, RTL simulators, and hardware. Access to all three levels of the design provides insight that is missing from previous published studies. First, we show that applying common architectural power models out-of-the-box to TRIPS results in an underestimate of the total power by 65%. Next, using a detailed breakdown of an accurate RTL power model (6% average error), we identify and quantify the major sources of inaccuracies in the architectural power model. Finally, we show how fixing these sources of errors decreases the inaccuracy to 24%. While further reductions are difficult due to systematic modeling error in the simulator, we conclude with recommendations to improve architectural level power modeling
TRIPS processor reference manual by Robert McDonald( )

1 edition published in 2005 in English and held by 1 WorldCat member library worldwide

This document describes the TRIPS Processor, including its instruction set, register set, and general processing model. TRIPS is a novel, scalable, and low power architecture for future technologies
Modeling the impact of device and pipeline scaling on the soft error rate of processor elements by Premkishore Shivakumar( )

1 edition published in 2002 in English and held by 1 WorldCat member library worldwide

This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600nm to 50nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs
Low-power, high-performance analog neural branch prediction by Renée St. Amant( )

1 edition published in 2008 in English and held by 1 WorldCat member library worldwide

Shrinking transistor sizes and a trend toward low-power processors have caused increased leakage, high perdevice variation and a larger number of hard and soft errors. Maintaining precise digital behavior on these devices grows more expensive with each technology generation. In some cases, replacing digital units with analog equivalents can allow similar computation to be performed at higher speed and lower power. The units that can most easily benefit from this approach are those whose results do not have to be precise, such as various types of predictors. This paper describes an analog implementation of a neural branch predictor, which uses current summation to approximate the expensive dot-product computation required in digital perceptron predictors. The analog neural predictor we simulate is able to produce an accuracy equivalent to a digital neural predictor that requires 128 additions per prediction. The analog version, however, can run in 200 picoseconds, with the analog portion of the prediction computation less than 0.4 milliwatts at a 45 nm technology, which is negligible compared to the power required for the table lookups in this and conventional predictors
Skirting Amdahl's Law : using SPSD execution with optical interconnects by Doug Burger( Book )

1 edition published in 1996 in English and held by 1 WorldCat member library worldwide

Abstract: "Optical interconnects provide new parallel processing opportunities through inexpensive broadcasts high-band-width, point-to- point connections. However, the problems of flow control and buffering inhibit current parallel architectures from effectively exploiting the advantages of optical interconnects. We propose using an execution model, called Single Program, Single Data stream (SPSD) to exploit inexpensive optical broadcasts and reduce the serial overheads of parallel programs. We describe one possible implementation of such a system (Data Scalar), and discuss how future systems can be designed to better exploit optical interconnects."
DSP extensions to the TRIPS ISA by Kevin Beckwith Bush( )

1 edition published in 2007 in English and held by 1 WorldCat member library worldwide

In this paper, we propose a set of DSP extensions to the TRIPS ISA and evaluate their performance. By extending the TRIPS ISA with specialized DSP instructions, we offer an explorative look at the interaction conventional specialization techniques (such as SIMD instructions) have with EDGE ISAs. We discuss the implementation and its feasibility and provide non-intrusive compiler support through hand-written library functions. Finally, we evaluate the performance benefits of our extensions with custom library-emphasizing benchmarks and compare our results with those of the industry standard TI c6416 digital signal processor
Assessment of MRAM technology characteristics and architectures by Desikan Rajagopalan( )

1 edition published in 2001 in English and held by 1 WorldCat member library worldwide

On-chip MRAM as a high-bandwidth, low-latency replacement for DRAM physical memories by Rajagopalan Desikan( )

1 edition published in 2002 in English and held by 1 WorldCat member library worldwide

Impediments to main memory performance have traditionally been due to the divergence in processor versus memory speed and the pin bandwidth limitations of modern packaging technologies. In this paper we evaluate a magneto-resistive memory (MRAM)-based hierarchy to address these future constraints. MRAM devices are nonvolatile, and have the potential to be faster than DRAM, denser than embedded DRAM, and can be integrated into the processor die in layers above those of conventional wiring. We describe basic MRAM device operation, develop detailed models for MRAM banks and layers, and evaluate an MRAM-based memory hierarchy in which all off-chip physical DRAM is replaced by on-chip MRAM. We show that this hierarchy offers extremely high bandwidth, resulting in a 15% improvement in end-program performance over conventional DRAM-based main memory systems. Finally, we compare the MRAM hierarchy to one using a chipstacked DRAM technology and show that the extra bandwidth of MRAM enables it to outperform this nearer-term technology. We expect that the advantage of MRAM-like technologies will increase with the proliferation of chip multiprocessors due to increased memory bandwidth demands
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.72 (from 0.56 for Skirting A ... to 0.85 for Design of ...)

Alternative Names
Burger, Douglas C., 1969-

Burger, Douglas Christopher, 1969-

Languages
English (24)