Ian Foster, University of Chicago & Argonne National Laboratory
"The Grid" has been my area of research for several years now. Before going to graduate school I spent a year working for a company called Bibliographic Retrieval Services. Now I find myself involved in something we call "grid computing." More particularly, what I will talk about today are very large-scale data grids. Over the last several years more and more of what I do is concerned with or at least impinges upon issues of data curation and metadata and discovery issues relating not only to data but also computational procedures, the results of computations, and so forth. Although working for many years in high-performance computing, I have increasingly been encountering these problems of grids.
First, I want to say a few words about the technology landscape. I will talk a bit about what we mean by grid computing and why it is significant and where it is going. Then I will talk about the analysis and management of very large amounts of scientific data.
Exponentials
It is worth reminding ourselves of some of the forces that shape our scientific landscape, especially a variety of exponentials. Moore's law is a significant force in many aspects of society and particularly in scientific computing. The basic formulation of this is that transistor counts tend to double roughly every 18 months. This, of course, means that our ability to simulate complex systems—such star-formation—increases rapidly. It also has a significant impact on our ability to construct sensors that collect data about our physical systems. Both computing and sensors are increasing fairly quickly.
Within the world of computational science, one implication of this is that any scientist now can construct a very high-performance supercomputer out of a small number of PCs wrapped together in one fashion or another and can therefore generate data at tremendous rates, as well as process data, such as sensor data, at very high rates.
The second exponential that we need to be concerned about is storage. Storage density is increasing even faster by most counts than transistor count and therefore computer speed. It is quite feasible to start seeing individual scientific disciplines talking about community data repositories that are close to a petabyte in size today and, by 2005, 10 petabytes, and so on. A petabyte is 1,000 terabytes, or a million gigabytes.
This is certainly having a transforming effect in the physical sciences, as I am sure you know, and in the biological sciences—and I think very soon in the humanities. The goal of a proposed project at the University of Chicago called The Digital Dig is to digitize information about Egyptian antiquities, which are disappearing very quickly. I find that project very interesting as an example of how something that you think of as not being data intensive at all could easily become extremely data intensive, with terabytes of data.
We are going up this curve from gigabytes and terabytes to petabytes of data partly because of simple improvements in sensor and storage capability, but also partly as a result of research communities realizing the importance of data and having the technology to produce it and therefore increasing investment. Some of these datasets that have been produced represent tremendous investments and are therefore tremendously valuable community resources, in areas such as earth observations, gravitational wave searches, nuclear physics, and digital sky surveys.
It is not that long since an astronomer, rather like a biologist, would spend a lifetime studying an individual star. A biologist would study an individual bug. Now it is becoming feasible to do whole-sky surveys that take pictures of the entire sky at some particular wavelength. Such sky surveys are coming online now. Within five years there should be a dozen different wavelength surveys of the sky, each of them producing multiple terabytes of data; that is not a huge amount by the standards that you would find in physics. It is already allowing for some really interesting scientific discovery based not on studying individual specimens but on "epidemiological" studies of millions of stars—a very different approach to scientific investigation.
For example, images of the Crab Nebula at four different wavelengths let you see very different information in each of them. Of course, the correspondences between them lead to some interesting information.
This is just a prelude to what is coming. The planned Large Synoptic Survey Telescope will start doing these sky surveys not just once, like the Sloan Digital Sky Survey, but every few days so that you will be able to watch behavior—and in the process produce about 10 petabytes of data per year. It is going to completely change the face of astronomy.
The same thing is starting to happen in other disciplines, not just physical sciences. Biology and medicine are going through the same sorts of revolutions: the Human Genome Project and other genome databases; proteomics and studies of protein interactions and drug delivery; high-resolution x-ray crystallography; medical data—from x-ray to mammography to digitized patient records; high-resolution medical instrumentation. For example, there is the notion of a Virtual Population Laboratory, which is a computer system designed to model the behavior of disease outbreaks within the world's population.
Similar work is going on with brain scans. The brain is a lot of data, it turns out. Not so long ago you could image brains noninvasively at centimeter resolution, which produced three kilobytes of data. Then people got to the point of being able to do millimeter resolution, three megabytes. Currently they can reach around 10 microns, which generates three terabytes of data for a single brain scan. That is pretty impressive. The goal is to be able to track the location of individual cells, which will require one-micron resolution and will generate three petabytes of data for a single brain scan.
The next thing that people want to do, once you have gone to this tremendous amount of effort, is to collect these different brain scans together, hundreds or thousands of them, and then start to form comparisons. That, of course, is where some of these grid concepts come into play.
I have gone over two exponentials, computing and storage, but we also need to be aware of a third one: networks. Network performance is also increasing exponentially, and actually with a different coefficient from the other two. People argue as to what the exact coefficients are, but the key is that network speed is doubling roughly every nine months as against 18 or 12 months for computing and storage. The difference is an order of magnitude every five years, which has a dramatic impact on the relative performance of networks and computers and storage.
Just to illustrate what this means in practice, here are some numbers taken from the National Science Foundation supercomputer centers in the US:
- Over a 15-year period the speed of the supercomputers installed at those centers has increased by a factor of 200. That is a tremendous amount, from giga (10^9) operations per second to tera (10^12) operations per second.
- Over the same 15-year period we have gone from a 56-kilobit dial-up modem that was the state of the art in 1986 to a 40-gigabit-per-second backbone, which is currently being put in place to connect the major supercomputer centers—basically three orders of magnitude more capacity.
- Over the next few years we can presumably expect a similar difference. Computers will get much, much faster, but networks will get much, much, much faster. What is happening is that things are getting much closer together in some sense. The speed of light, of course, stays the same, but otherwise the topology of our world is changing, and we are able to do new things with this huge amount of data that is being produced and the huge amount of computing power that is becoming available to us.
Evolution of the Scientific Process
It is clear that these changes have profound impacts on the nature of the scientific process. If we can talk about pre-electronic and post-electronic ages, pre-electronic scientists would mostly work alone or on a small team. They would theorize or conduct experiments and they would typically communicate their results by publishing a paper or maybe going to a scientific meeting and giving a talk.
There are disciplines, of course, that still communicate in that way, but it is no longer the only way or necessarily even the usual way of engaging in scientific research. It is just as common now that people will work on very large teams. They will engage in a collaborative process that has as its goal the construction of very large databases or the extraction of information from those databases. They may engage primarily in the development of complex simulation codes. For example, in the climate-modeling community much of the scientific data revolves around development and analysis and verification of computer programs. Then they may exchange information not by publishing papers but by exchanging information within very large distributed and multidisciplinary teams.
The Grid
This leads us to this notion of the Grid. This term was coined by some colleagues to capture an analogy with the electric power grid, which is part of what I think the Grid is about—perhaps not the most important part. The analogy is intended to convey this notion of instantaneous, on-demand access to computing capabilities or information capabilities. Throughout the US, you can plug in anywhere and obtain power in a standard fashion. We want to do even better and more widely than the electric power grid has done.
These technologies are about enabling resource sharing within these virtual teams, or what I often call virtual organizations, as they tackle problems of common interest such as the solution of various scientific problems. These virtual teams will tend to be distributed across many institutions, each comprising different goals and different capabilities and resources of various sorts.
A picture of the Grid from the perspective of a computer technologist shows that we have to deal with things like network partitions and overlapping virtual organizations and different trust relationships, that is, lots of things that we think about as technical problems but that of course are primarily social or policy problems and therefore perhaps beyond the scope of computer technology. But I suspect they are very much within the scope of what you people think about.
That is the basic notion of the Grid: a set of technologies and then a set of computer applications that enable and profit from resource sharing within these dynamic virtual organizations. An example of a contemporary virtual organization would be the various groups around the world who are participating in the design and ultimately the construction of the Large Hadron Collider at CERN. This is moderate-scale in terms of the number of people—a couple of thousand physicists—and fairly large in terms of the number of participating institutes and countries. These people are trying to solve a single common problem, which at some level looks like a device-construction program but is really largely a data analysis problem, because that is what LHC is about more than anything.
Once the LHC comes online, they are going to be generating data in Switzerland at a rate of hundreds of megabytes per second, 24 hours a day, seven days a week. They are going to want to distribute this data to these 2,000 physicists in their 32 countries and however many hundred institutions around the world. These same people are going to want to pose questions against that data. Has this sort of event happened? What about this? What about that? How does it compare with this set of simulated data that I have constructed?
This community has embraced this notion of a data group, a worldwide, distributed, integrated set of computing resources, stored resources, and network resources as a means of dealing with this computational and data storage challenge. No longer do they believe that it is feasible to build a big data center in one single location. They believe they need to be able to harness resources on a global scale to solve this problem.
Many communities are embracing this notion of resource sharing, virtual organizations with technologies as a way of addressing hard technical problems. Certainly this includes people whose primary goal is profound scientific research, such as physicists and astronomers. It also includes people whose primary goal is the design and engineering of systems—for example, civil engineers who want to collect data from shape tables and wave tanks and centrifuges and so forth.
It also increasingly involves industry, people in industry who don't have anything in particular to do with science or engineering but are simply concerned with constructing very reliable, large-scale, low-cost, secure infrastructures for information processing. It is important to realize that there is an industrial component to this computing concept in business, because that is going to have a profound impact for what actually happens. The industrialist interest leads to headlines in The Economist and The New York Times and so forth. A bit of hype is good. Some of it is sort of true, but our focus is really on: What does this mean for the scientist? How do we turn these concepts into tools that can change the way that we process scientific data?
People, of course, have been doing distributed computing for a very long time, but it was really in the early to mid-1990s that various people, primarily in the US initially and then very soon after in Europe, got interested in what you could do with very high-speed networks and started to realize that high-speed networks really were different—because things were so much closer together and, therefore, things could be coupled together in such different fashions.
After a period of experimentation with, for example, the I-WAY and academic software projects such as Globus and Legion, we got to where we are today, which is a pretty interesting point. We have many dozens of projects not only in the US but also in the UK, funded also by the European Union, increasingly in Japan. We have a significant technology base that has commercial support and growing industrial interest. We even have in our virtual organization the Global Grid Forum, which had its first meeting in Amsterdam last year and is now up to roughly 500 people who meet every four months or so to talk about standards and so forth.
Challenging Technical Requirements
I am a computer scientist and a technologist. A lot of what interests me in this are the underlying technical problems: how you form and manage these virtual organizations and what technologies you need to manage that process. How do you discover what resources and services and people exist within a virtual organization, and how do you obtain access to them and link them together? How do you build systems that deliver multiple qualities of service? The quality of service might include performance, security, reliability, etc. How do you build an infrastructure that does all this in a reasonably automated fashion?
These are problems that we are pursuing at the moment in our Globus project in collaboration with many others, including IBM and Microsoft and SUN. We have this thing called the Open Grid Services Architecture which I encourage you to look at if you are interested in the technology (see More Information).
Let me focus in now on some of the grid problems that relate particularly to data management and data analysis. Grid computing is a fairly broad umbrella covering a lot of things that don't particularly relate to data management or analysis, but one of the most important drivers to emerge has been data-intensive science.
With the emergence of massive data collections, we have the realization that these data collections are important community resources that need to be managed and made available to a community—and the realization that the analysis of this data is itself a community process in many cases, and the procedures used to analyze data are themselves community resources. Then, of course, there is the practical factor that the data and the computers and the people who want to perform these analyses of the data are geographically distributed, so we need to be able to link these things together in some effective fashion.
Data grids are the infrastructure that may allow us to collect, manage, assess, and ultimately interpret these huge amounts of data. A large number of data grid projects are now underway worldwide. In the US the Physics Data Grid and the Grid Physics Network project are collectively designing and building the infrastructure required for very large-scale analysis of large amounts of data. The International Virtual Data Grid Laboratory (iVDGL) is a global grid laboratory to support those two projects. The TeraGrid project is sort of a parallel effort to the iVDGL; it is a very large, distributed supercomputing system for, among other things, data-intensive science.
In Europe a similar set of projects—the EU Data Grid, CrossGrid, and DataTAG projects—are looking at various aspects of data-intensive science. Japan is just getting going, with some large projects coming online, we hope, in the near future that will focus on various projects in the Japanese context.
All of these are big collaborations, with in some cases hundreds of people in collaboration with application scientists and computer scientists. They are all focused on infrastructure development and deployment. They all run the Globus software.
There is more to grids than physics and astronomy. A National Institutes of Health initiative is pursuing the problem of processing the brain image data that I referred to earlier. The Biomedical Informatics Research Network (BIRN) is starting to link 10 labs within the US. Each now has a BIRN system, which provides a few terabytes of storage and a modest amount of computing. The goal is to scale these up and link them into other systems to produce a national-scale Biomedical Informatics Research Network with petabytes of data and large amounts of computing for National Institutes of Health-style problems.
Data Grid Technologies and Facilities
The Grid Physics Network (GriPhyN) is a reasonably large research and development project—about 13 million dollars over five years—that is intended to produce data-grid technologies for the analysis of data from four physics experiments. These are US-CMS, US-ATLAS, LIGO/LSC—the idea here is to be able to detect the very minor movements in space/time that will occur as a result of remote collisions of black holes and things like that—and the Sloan Digital Sky Survey. One element is basic research in information technology, trying to work out how we might represent the computational procedures and the data generated or used within these data groups. The second is actually deploying this machinery in a way that these experiments can use it for large-scale data analysis, so that they can make some sense out of these petabytes of data rather than having them sit unused in storage systems somewhere. Another important element is delivery to the larger scientific community of toolkits that will allow them to apply the same technologies within their projects.
This is a big collaboration. There are lots more collaborators besides the primary ones. It is based at the University of Florida and the University of Chicago, and I and Paul Avery from the University of Florida lead it.
This picture tries to capture some of the technical issues that are being addressed in the project. At the highest level we have some mixture of one these virtual organizations. In the case of LIGO it might be fairly small; in the case of an LHC experiment it could be thousands of physicists, individual investigators, some production team responsible for processing data that is produced, perhaps work groups focusing on different technical subproblems. They are all going to be posing queries to what we want them to be able to view as a single, large, virtual data grid. This notion of virtual data is very important in what we are doing
Under the covers there are more data sources of various sorts; certainly very large databases. There are also lots of computational resources and program resources, and computational procedures that have been developed by the virtual organization that is using these technologies. The goal is that people will be able to ask for data products, presumably using some very sophisticated metadata mechanisms, and then the overall system, the petascale virtual data group, will be smart enough to find the piece of data you asked for and deliver it to you, or find that it doesn't exist. Perhaps it corresponds to some piece of virtual data that could be produced by running one of the transforms. In that case it will work out what transforms need to be run and how to schedule the required data movements and computations. The virtual data tools tell you what has to be done to produce the data. The request planning and scheduling tools plan the actual operations, and then the request execution and management tools actually coordinate what may be thousands of computer and storage resources in order to produce the piece of information that you want.
For example, LIGO is a gravitational wave observatory just coming online now. It will be collecting information signals from gamma-ray bursts. Then some scientist might go off and produce a procedure that would calculate the quantity for the gravitational strain. At some later point a user might come along and say, "Get me the gravitational strain for two minutes around all of the gamma-ray bursts that you have seen over the last year."
What we want to be able to do is find out if somebody has already asked for that piece of data and, if so, retrieve it. Otherwise, we will work out what procedure has to be run and how expensive it will be, and schedule the computations required to produce it. Of course, there is lots of interesting technology required to make that work, especially for the level of extraction that we are providing to scientists.
The technical problems we are dealing with relate to lots of interesting scheduling issues, how you represent what policies may govern, who can use which resources, where to put data once it has been produced, who is allowed to use which resources for which purposes, who is allowed to actually see different things, perhaps issues relating to cost estimation of not only retrieving data but also generating data.
The GryPhyN research agenda includes technical execution management, instrumentation and performance analysis, and a virtual data toolkit that enables technology transfer to other scientific domains. What is perhaps more interesting for other sciences is that we are already starting to release the software so that it can be used by other communities. We are seeing other communities, for example in biology, beginning to pick up on some of the basic technology and use it for managing the data that they are producing.
Roughly a year after we had written the GryPhyN proposal to NSF, we realized that progress in this area really required us to create a real experimental facility in which we could start to experiment on a large scale with some of these virtual data concepts. We wrote a proposal in which we proposed to create, operate and evaluate an international research laboratory for data-intensive science.
The iVDGL, the International Virtual Data Grid Lab, is intended to be a global lab linking countries or resources around the world. We are starting in the US and Europe as our two initial focal points. The idea is that it will be a production facility that can be used for CERN experiments and others, and also an experimental laboratory for other scientists. We are hoping, for example, that we will get various people involved in proteomics and other biological areas to start using these resources for their disciplines. It represents a reasonably substantial investment by the National Science Foundation—13.65 million dollars—with a lot of cost sharing and partnerships with other organizations internationally.
The sites that are initially being linked in the US are an interesting mixture of big labs, significant key universities, and also smaller universities and some minority institutions, so there is a real outreach component. They include Boston University, Brookhaven National Laboratory, California Institute of Technology (Caltech), Fermilab, Hampton University, Johns Hopkins University, Pennsylvania State University, Salish Kootenai College, University of California at San Diego, University of Florida, University of Indiana, and University of Wisconsin. We are already getting a lot of other people wanting to join and a significant number of sites within Europe, interest in Japan and Australia, and other countries.
|
GryPhyN Institutions
Argonne National Laboratory | Boston University | Brookhaven National Laboratory California Inst. of Technology (Caltech) | Fermilab | Harvard University Indiana University | Johns Hopkins University | Lawrence Berkeley Laboratory Northwestern University | San Diego Supercomputer Center | Stanford University Univ. of California, Berkeley | Univ. of California, San Diego | University of Chicago University of Florida | Univ. of Illinois, Chicago | University of Pennsylvania Univ. of Southern California / ISI | Univ. of Texas, Brownsville | Univ. of Wisconsin, Madison | Univ. of Wisconsin, Milwaukee |
At each of these sites we will have some storage and some computing, some common software running—the Globus and other software that I have talked about— so that people can discover what resources are available to store data there and access data, and a common set of policies so that people wanting to request virtual data quantities can use these resources in some coordinated fashion.
Programs as Community Resources
I find particularly interesting at a technical level the question of what is the real nature of these virtual organizations and what are the resources that they are attempting to share as they pursue their common goals. Historically, as I said, the scientific communities worked by sharing technical information either by talks or written papers. It has started to be the case that scientific data is viewed as a significant resource—and also various forms of instrumentation or computer systems. Big databases are now significant resources.
But for the experiments I have been describing, the data the scientists are talking about is not their raw measurements. They don't spend their time looking at the original digital images collected from their telescopes or their MRI machines or whatever. Essentially all the data is computationally corrected or reconstructed in some fashion. Increasingly, so-called data is actually produced by some form of numerical simulation. Climate-model simulations, which may be very large and represent a huge investment in computer time, are significant resources in many communities.
We have this interesting situation where data is certainly important, but so are the programs that produce the data and so are the executions of these programs. You don't just have data; you have what I would call transformations, the computational procedures that produce or modify data, and then executions of these transformations. The derivation is an execution of a transformation; it consumes some data and generates some data, and then a piece of data will be created by a particular transformational procedure.
In our GriPhyN project we are attempting to codify not just information about data, which of course people have done for a long time, but also information about transformations and derivations. The idea is that people can then start to ask questions like, "Here is a piece of data that someone has told me about, but I need to know what computational procedures were applied to it, the way it was derived from the original data, before I know I can use it for some purposes." That is certainly a big concern of astronomers and I think also of people in genomics, for example. As you apply various forms of gene prediction procedures, you end up with data about which your certainty is less and less.
Similarly, you might at some later point find out that there was actually a mistake in your original data and now you need to know what data you have to recorrect. If you have a comparative representation of these relationships, you can work that out.
Finally, there is the basic virtual data concept that I referred to earlier. I would like to know if somebody has already run a particular procedure so that I don't have to recompute it. Obviously, the potential cost savings here are tremendous if you can avoid recomputations.
We are putting a lot of effort into the GriPhyN project and into building database structures that can keep track of this sort of information. We have a system that comprises a virtual data catalog that keeps track of not only the data but also derivations and transformations, a virtual data language that allows one to query and update this virtual data catalog, and various applications that interact with this information in some interesting ways.
Let's say I ask for some data product that does not exist. We are set up now so that we can automatically construct a data execution schedule that will go out and run computations on potentially hundreds of computers at different sites across the US on our data grid in order to generate the data that I have asked for. The next time I ask for it, of course, the same computation will not need to take place because the data will already exist and has been indexed in our catalogs.
A histogram of a number of tasks running over time would show that in these sorts of things we are sometimes managing the execution of hundreds of computers. Sometimes many more.
Most of what I have talked about has really been concerned with computation, management of data movement, pretty low-level things. This is not to say that issues relating to information management and knowledge management are not important for these data grid applications. In fact, as part of our team, Reagan Moore at the San Diego Supercomputer Center is trying to work out how we will push up higher in the stack from data to information to knowledge as we build these data grid systems. Some of the work we are doing with these virtual data catalogs and the encoding of information for computational procedures is really pushing in that direction. We are starting to work out how to build knowledge bases that will tell us which computational procedures are appropriate for particular purposes, for example.
New Programs
A number of new ongoing initiatives are going to make what I have been talking about today even more important in the next few years as they push throughout the sciences.
This Grid concept, back in the early to mid-1990s, was an academic exercise being constructed by a few eccentrics. Then it started to become mainstream within a few disciplines. At this point we are seeing some major programs underway to apply it on a very large scale. The United Kingdom was one of the first to pursue this really wholeheartedly. They have what they call the eScience program—which I am sure anyone in the UK knows a lot about—which is pursuing very effectively this notion of application of grid computing concepts to a whole range of disciplines. They have projects in genomics, physical sciences, climate, environmental science, all trying to apply these concepts to various scientific disciplines.
In Europe, of course, there is the work currently going on to define the 6th Framework program, and it looks like it has a significant grid component to it. Japan is just getting going, as I said, with their Japanese Grid initiative.
This last Thursday the preliminary report in the US of the so-called Blue Ribbon Panel on Cyberinfrastructure was released, and it had some very interesting comments to make on some of these topics. They believe that we are at a point where we need to see major new investments in computing and communication infrastructure to revolutionize science and engineering research at NSF and worldwide. It includes supercomputing, but also storage, networking, software, collaboration, visualization, and human resources. Current centers such as the National Center for Supercomputing Applications, San Diego Supercomputer Center, and Pittsburgh Supercomputing Center are key resources. They have focused on the application of these grid concepts on a large scale, and they are talking about a huge amount of money. Whether they manage to get it or not remains to be seen, but they are talking about an incremental 650 million dollars a year. They also recommend empowering a central office to initiate path-breaking applications within NSF, coordinate policy and allocations across fields and projects, develop high-quality middleware and other essential software, and manage individual computational, storage, and networking resources as least a hundred times larger than individual projects or universities can provide. It shows that something is going to happen even more than has been happening in the past.
To summarize:
- It is always worth reminding ourselves that exponentials are so surprising in their implications. We have exponentially more computing, more exponentially more data, and then even more exponentially more networking. These different exponentials are having some really profound impacts on how we all do our business.
- The Grid is a concept but also a set of technologies that have proven to be very effective, which is allowing us to start sharing resources on large scales. These resources include certainly physical resources—big databases and big computers and networks and so forth, which are so important for many of the sciences—but I would hope that there would be the more subtle information and data resources that you are concerned with.
- I believe very strongly that this notion of petascale virtual data groups—a horrible phrase but maybe a meaningful one—represents at least a significant part of the future in science infrastructure. This is the sort of environment that people are going to be working in in just a few years; in fact, some already are.
- I have suggested that one important part of this future is that not only is data a community resource that we need to be able to describe and share and make available to the community, but so also are computers and programs and the results of computations—things that we have not perhaps thought of previously as shared information resources that need care taken of them.
More Information
From the Discussion that Followed
Discussion ranged from scientific methods and data reconstruction to the tensions and conflicts of resource sharing, to long-term access and just who does the archiving.
Participant 1: I notice that your community tends to talk about data curation rather than long-term retention or digital preservation. There is great significance in that fact.
Participant 2: In the UK the ownership of the huge datasets is with the research councils. In a way, it is about protecting that data when those people have finished, for that repurposing. I am from the British Library, so we have the India office library and records and the India office log books which are now being used extensively to look at climate change, because these are the people who were recording climate in the 18th century and earlier.
When it is no longer of interest to the original community, perhaps it has to be taken away from that community. At the moment, certainly in the UK, it is the responsibility of the community.
Moderator: Essentially, there is no single record or archived document which is nowadays used for the same purpose as it was originally created. The original creator doesn't care any more, and is long dead and forgotten. To take a Dutch example, why is our meteorological data so sought out? Because already in the 16th century the people who ran the tow boat between Leiden and Haarlem and Haarlem and Amsterdam noted down when the canal was frozen so that they could not operate the service. They constructed a very fine set of data of winters in Holland. At that time the company didn't think about that possible reuse, just like scientists now are not thinking of any future use of material. We, as curators and preservers of the heritage of the past, know what is going on.
Participant 3: Somehow since the budgets are considerably larger with those scientific communities, that sense of responsibility has to move where the money is greater.
Ian Foster: It is unfortunately the case that even in these disciplines that sometimes have tremendously large budgets it is very hard to get money for data analysis and also for data storage.
Participant 4: I think there is another problem in that the new society that the Grid represents is cross-boundary, cross-domain in every respect and, yet, we as custodians are funded to build nationally or locally or institutionally and we are just not positioned other than through consortia like this to look at the scope of these huge transnational activities. That seems to me to be a fairly major problem that we have to confront fairly soon if we are going to solve it.
Participant 5: I am thinking that it is not necessarily the scientific community failing to look for long-term storage. It may well be the library community's failure to keep up with the scientific community in the way that they collect and use data, and we are trailing quite a bit behind. Our obligation is to develop funding options and to communicate about the long-term value of storage and access.
Moderator: I don't think one could debate this question in terms of failure. In the 19th century libraries had taken over some of the responsibilities of scientists and writers and said, "We will care for them." Throughout 100 or 200 years, the scientific community could think, "This is something the library will do for us." Now we have to turn back again, I think.
Participant 5: No, I think we have done a much better job in worrying about the literature and arts and other creative sides of our intellectual property, but less well on science partially because of the scale that is involved in supporting data archiving and retrieving.
Moderator: I think in libraries and in archives we were for a very long time at the back end of the chain. Now in the digital era we have to move to the front end of the system where the data or information is used. We have something to say there, and we should put our requirements at the front end and say, "If you are creating and using data information, if you are creating records in a digital record-keeping system, the requirements for preservation in the long run and metadata are these and these." We have to move forward in the chain.
Participant 6: I think one of the interesting things for me to take away from what you have shown us is this notion of data as a community resource. Responsibility for the data has in fact been taken over by the community. Whether we are happy with the way it is being done, it is good enough for their purpose, and that is what they are thinking about.
What about our traditional roles for the constituencies that we have traditionally been much more involved in serving? What happens when they go virtual? How will that change what we are doing? Are we going to encounter a similar kind of evolution, and what is the implication for the caretaking associated with that data?
There are two things that might happen here. One is that we could say that, because we are not contributing to the process, those virtual communities will take care of themselves. Or, we could say that this is an opportunity for us to learn from what is being modeled in the scientific community, and we are a little bit ahead of the curve, and we can figure out how in fact we can be an important part of the way our constituencies are going to be doing their work in the future.
I am not sure which of those things it will be. What are the implications going to be for our work a few years from now?